i’m a curious soul.
i was curious what bots were scraping my sites, so i figured i’d do a quick survey.
i’m also super lazy, so i used a simple bash script to get the list of whoever has been pulling my robots.txt file, because no one else would.
so a quick
zgrep robots.txt access.log* | cut -d\" -f 6 | sort |uniq > agents.lst
on soc.jrconlin.com got me:
- AwarioSmartBot/1.0 (+https://awario.com/bots.html; bots@awario.com) FediCrawl/1.0 Googlebot/2.1 (+http://www.google.com/bot.html) IonCrawl (https://www.ionos.de/terms-gtc/faq-crawler-en/) Mastodon server indexer Minoru's Fediverse Crawler (+https://nodes.fediverse.party) Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy) Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:98.0) Gecko/20100101 Firefox/98.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0 Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE Mozilla/5.0 (Windows NT 9_1; Win64; x64) AppleWebKit/547.47 (KHTML, like Gecko) Chrome/61.0.1793 Safari/537.36 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4 240.111 Safari/537.36 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/543.47 (KHTML, like Gecko) Chrome/54.0.2644 Safari/537.36 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html) Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler) Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot) Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; Linespider/1.1; +https://lin.ee/4dwXkTH) Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) Mozilla/5.0 (compatible; Nmap Scripting Engine; http://nmap.org/book/nse.html) Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) Mozilla/5.0 (compatible; SeznamBot/4.0; +http://napoveda.seznam.cz/seznambot-intro/) Mozilla/5.0 (compatible; WellKnownBot/0.1; +https://well-known.dev/about/#bot) Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) Mozilla/5.0 (compatible; Yeti/1.1; +https://naver.me/spd) Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36 Poduptime/Development/Testing Poduptime/Production from https://fediverse.observer Scrapy/2.7.1 (+https://scrapy.org) SerendeputyBot/0.8.6 (http://serendeputy.com/about/serendeputy-bot) caveman-hunter/0.0.0 (+https://fedi.buzz/) curl/7.54.0 ws-bot-v1
Which is a lot.
i’m also kinda curious about how many bots pretend really hard not to be a bot. (Looking at you Applebot). i know Google has (at least) two different flavors of crawlers (one fast, the other slow, so no huge surprise there.)
Now, compare this with my close to 20 year old blog:
Buck/2.3.2; (+https://app.hypefactors.com/media-monitoring/about.html) Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Yeti/1.1; +https://naver.me/spd) Chrome/113.0.0.0 Safari/537.36 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) Mozilla/5.0 (compatible; MojeekBot/0.11; +https://www.mojeek.com/bot.html) Mozilla/5.0 (compatible; ScooperBot/3.0; +http://www.carma.com) Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) Mozilla/5.0 (compatible; SeznamBot/4.0; +http://napoveda.seznam.cz/seznambot-intro/) Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36 Scrapy/2.6.3 (+https://scrapy.org) Twitterbot/1.0
That’s a whole lot less. As in 58% less.
Also kinda interesting to see the different bots that look for things. Clearly, the Federation attracts more bots.
It’s also kinda hilarious to me that some of my domains (like EvilOnAStick.com get FAR less crawlers. Apparently, these sites are part of the Dark Web.