AI爬虫大规模爬取网站内容，导致网站打不开，附解决方案

孤僻成性3p · 发表于 2025-5-9 10:35:59

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？立即注册

×

AI工具的爬虫疯狂抓取网站内容，导致网站一直加载中，就是无法正常访问（看日志才发现早就被爬了，只是网站没有挂才没发现），搜解决方法的时候看到了这篇帖子，跟着进行了操作，同时使用豆包AI进行分析，结合豆包给出的方案一起使用，这里分享下。

方案一先是通过宝塔面板 Nginx 免费防火墙插件的 User-Agent 过滤了 AI 爬虫，参考了@雨天榕树大佬在评论区分享的资料链接：https://www.52txr.cn/2025/banaicurl.html
有我根据自身情况新增的爬虫

(ScrapyIAwarioBotIAI2Bot|Ai2Bot-Dolma|aiHitBot|anthropic-ai|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|TikTokSpider|VelenPublicWebCrawler|YouBot)

复制代码

又是用 robots.txt 限制 AI 爬虫和百度的爬取频率

# 百度蜘蛛：允许访问，但限制抓取间隔
User-Agent: Baiduspider
Crawl-delay: 5
# AI爬虫及特殊工具：禁止访问整个网站
User-Agent: Scrapy
Disallow: /
User-Agent: AwarioBotI
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SplitSignalBot
Disallow: /
User-agent: SemrushBot-OCOB
Disallow: /
User-agent: SemrushBot-FT
Disallow: /
User-Agent: AI2Bot
Disallow: /
User-Agent: Ai2Bot-Dolma
Disallow: /
User-Agent: aiHitBot
Disallow: /
User-Agent: Amazonbot
Disallow: /
User-Agent: anthropic-ai
Disallow: /
User-Agent: Applebot
Disallow: /
User-Agent: Applebot-Extended
Disallow: /
User-Agent: Brightbot 1.0
Disallow: /
User-Agent: Bytespider
Disallow: /
User-Agent: CCBot
Disallow: /
User-Agent: ChatGPT-User
Disallow: /
User-Agent: Claude-Web
Disallow: /
User-Agent: ClaudeBot
Disallow: /
User-Agent: cohere-ai
Disallow: /
User-Agent: cohere-training-data-crawler
Disallow: /
User-Agent: Cotoyogi
Disallow: /
User-Agent: Crawlspace
Disallow: /
User-Agent: Diffbot
Disallow: /
User-Agent: DuckAssistBot
Disallow: /
User-Agent: FacebookBot
Disallow: /
User-Agent: Factset_spyderbot
Disallow: /
User-Agent: FirecrawlAgent
Disallow: /
User-Agent: FriendlyCrawler
Disallow: /
User-Agent: Google-Extended
Disallow: /
User-Agent: GoogleOther
Disallow: /
User-Agent: GoogleOther-Image
Disallow: /
User-Agent: GoogleOther-Video
Disallow: /
User-Agent: GPTBot
Disallow: /
User-Agent: iaskspider/2.0
Disallow: /
User-Agent: ICC-Crawler
Disallow: /
User-Agent: ImagesiftBot
Disallow: /
User-Agent: img2dataset
Disallow: /
User-Agent: imgproxy
Disallow: /
User-Agent: ISSCyberRiskCrawler
Disallow: /
User-Agent: Kangaroo Bot
Disallow: /
User-Agent: Meta-ExternalAgent
Disallow: /
User-Agent: Meta-ExternalFetcher
Disallow: /
User-Agent: NovaAct
Disallow: /
User-Agent: OAI-SearchBot
Disallow: /
User-Agent: omgili
Disallow: /
User-Agent: omgilibot
Disallow: /
User-Agent: Operator
Disallow: /
User-Agent: PanguBot
Disallow: /
User-Agent: Perplexity-User
Disallow: /
User-Agent: PerplexityBot
Disallow: /
User-Agent: PetalBot
Disallow: /
User-Agent: Scrapy
Disallow: /
User-Agent: SemrushBot-OCOB
Disallow: /
User-Agent: SemrushBot-SWA
Disallow: /
User-Agent: Sidetrade indexer bot
Disallow: /
User-Agent: TikTokSpider
Disallow: /
User-Agent: Timpibot
Disallow: /
User-Agent: VelenPublicWebCrawler
Disallow: /
User-Agent: Webzio-Extended
Disallow: /
User-Agent: YouBot
Disallow: /

复制代码

结果发现还是打不开
方案二（一起使用）豆包给的方案：宝塔面板全局的 NGINX 配置文件中添加（在 http { 内添加）

# 1. 定义百度蜘蛛的User-Agent匹配规则（必须在http块内）
map $http_user_agent $is_baidu_spider {
default 0;
"~*Baiduspider" 1; # 匹配百度蜘蛛的 User-Agent
}
# 2. 定义限流区域（限制百度蜘蛛的请求频率）
limit_req_zone $binary_remote_addr$is_baidu_spider zone=baidu_spider:10m rate=100r/m;
# rate=300r/m：每个IP每分钟最多300次请求（可根据服务器性能调整）

复制代码

然后到网站配置规则里添加（在 server { 内添加）

# ------------------------ 缩略图专用优化（匹配完整路径） ------------------------
# 匹配 /_data/i/upload/ 目录下的所有图片文件（含时间子目录，如 /2024/08/08/）
location ~* ^/_data/i/upload/.*\.(jpg|jpeg|png|webp|avif|heic|heif)$ {
# 强缓存1年（CDN/浏览器均可缓存）
add_header Cache-Control "public, max-age=31536000";
# 兼容旧浏览器（30天缓存）
expires 30d;
# 关闭缩略图访问日志（减少磁盘IO）
access_log /dev/null;
# 继承全局防盗链规则（非法 Referer 已被拦截，无需重复判断）
}
# ------------------------ AI 爬虫与原图保护 ------------------------
# 定义需拦截的 User-Agent（AI 爬虫 + 恶意工具）
set $block_ua 0;
if ($http_user_agent ~* "(HTTrack|Apache-HttpClient|harvest|audit|dirbuster|pangolin|nmap|sqln|hydra|Parser|libwww|BBBike|sqlmap|w3af|owasp|Nikto|fimap|havij|zmeu|BabyKrokodil|netsparker|httperf|SF|AI2Bot|Ai2Bot-Dolma|aiHitBot|ChatGPT-User|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|Scrapy|TikTokSpider|VelenPublicWebCrawler)") {
set $block_ua 1;
}
# 放行合法搜索引擎（百度、谷歌等）
if ($http_user_agent ~* "(Baiduspider|Googlebot|bingbot|YandexBot|Sogou web spider|Bytespider)") {
set $block_ua 0;
}
# 针对原图目录（/upload/）强化拦截（仅拦截恶意 UA，不影响正常用户）
location ~* ^/upload/ {
if ($block_ua = 1) {
return 403;
}
try_files $uri $uri/ =404;
}
# ------------------------ 对动态页面限流（仅百度蜘蛛受影响） ------------------------
location ~* ^/(picture.php|index.php) {
# 直接应用限流（仅当 $is_baidu_spider=1 时，限流生效）
limit_req zone=baidu_spider burst=20 nodelay;
# 原有 PHP 处理逻辑（如 include enable-php-84.conf）
include enable-php-84.conf;
}
# ------------------------ 其他配置 ------------------------

复制代码

缩略图什么的是我网站使用的，根据实际情况修改。
方案三拉黑搜素引擎和 AI 蜘蛛的 IP 段（会导致网站内容不被收录）网站缓过来了在解除试下

bhtl · 发表于 2025-5-9 10:36:21

技术的大佬可以分析下方案二会不会对网站有什么不好的影响

dqm5384 · 发表于 2025-5-9 10:36:29

屏蔽下AI蜘蛛

mahuman · 发表于 2025-5-9 10:36:38

一律按CC处理，60秒访问超40次，拉黑IP封3600秒，解封后再次触发，封禁自动叠加

sinalook · 发表于 2025-5-9 10:36:46

网站弄成内部私密站，相当于知识星球，会员才可以看，就解决ai抓取问题了

hijacker · 发表于 2025-5-9 10:36:53

可以参考 https://phpy.cn/36.html

limao100 · 发表于 2025-5-9 10:37:01

维基百科都扛不住了

湖光倒影 · 发表于 2025-5-9 10:37:10

为分享干货点赞！我都是直接封IP，很多ai都是不讲武德，模仿用户访问，太损了。

		自动登录	找回密码
密码			立即注册