You can use security challenges as a mechanism to identify false positives.
Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.
Many bots cycle through short DHCP leases on LTE wifi devices. One would have to accept blocking all cell phones which I have done for my personal hobby crap but most businesses will not do this. Another big swath of bots come from Amazon EC2 and GoogleCloud which I will also happily block on my hobby crap but most businesses will not.
Some bots are easier to block as they do not use real web clients and are missing some TCP/IP headers making them ultra easy to block. Some also do not spoof user-agent and are easy to block. Some will attempt to access URL's not visible to real humans thus blocking themselves. Many bots can not do HTTP/2.0 so they are also trivial to block. Pretty much anything not using headless Chrome is easy to block.
[1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.
We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.
Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.
There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.
Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.