More

ethin · 2026-06-01T21:25:32 1780349132

I... Think you completely missed the point, which is that each and every method you enumerated is a brute-force tactic.

emtel · 2026-06-02T00:17:00 1780359420

How is drugging an animal's food a "brute force tactic"? What would qualify as a non-brute-force tactic in your mind?

fragmede · 2026-06-02T00:47:25 1780361245

Learning to speak cat and then reasoning with the cat, in cat-ese, to get into the carrier of its own volition.

famouswaffles · 2026-06-02T13:56:39 1780408599

There is no cat-ese. Cats do not have language.

jjulius · 2026-06-02T00:39:03 1780360743

Well now we're just ignoring the point for the sake of a semantics debate.

famouswaffles · 2026-06-02T14:09:43 1780409383

It's just a poor point. Cats are a bad analogy anyhow because there is no cat language. You can't actually speak to them or have a drawn out discussion. But you can with humans and many have inspired thousands or even millions to action doing just that, no super-intelligence required.

jjulius · 2026-06-02T15:43:05 1780414985

Still missing the forest for the trees.

ethin · 2026-06-01T21:01:44 1780347704

I really have wanted to do this but Fossil lacks MFA, OIDC, and of course CI/CD. Maybe there's a way to get all three in it but Idk. I know for OIDC you could in theory just use a reverse proxy to do it but then you have to get Fossil to respect it and not just ask you to login again.

ethin · 2026-06-01T17:46:56 1780336016

Yep... And just think: this is what AI boosters want us to do.

ethin · 2026-06-01T17:07:48 1780333668

Agreed, that's a huge turn off for me, and I thought this would genuinely be fascinating. I'm not a physics expert but I love reading about interesting things like this, but I can't stand this surface-level "well I in theory could be an expert on this topic but nobody knows because the machine removed all of the nuance and now it's shallow AI writing" style of writing.

ethin · 2026-06-01T00:38:12 1780274292

I primarily use Incus for all container stuff, not Docker. Is problematic if I want to e.g. use a docker-compose file, but I (think) it protects against these things because incus allows me to create a vm and not a container if I really need that level of isolation.

ethin · 2026-06-01T18:41:07 1780339267

What's the downvote for? Does someone really dislike Incus that bad?

ethin · 2026-06-01T00:26:45 1780273605

If only there was a language which allowed one to express instructions for a computer to execute which was nearly unambiguous, precise, deterministic, and containerized such that the computer would do exactly what you told it to.

...

Oh wait.

Yes, the above was referring to programming languages. Which is what prompts are, essentially. It's just a different (and more verbose) way of instructing the computer on what to do. It also has a solution space of infinity and is ambiguous enough that there is no way to secure it because there are infinite combinations of saying anything imaginable. All prompt injections do is prove this point, over and over and over again, and "prompting" an LLM is just reverse-engineering programming languages in the worst possible way. I suspect that we will eventually have no other choice but to revert to using programming languages because they are the only way to get the kind of protections that people are trying to come up with with all these containerization and virtualization systems (which inevitably fail).

onion2k · 2026-06-01T05:40:48 1780292448

You make a fair and valid point about prompts, but you're ignoring the fact that writing code that's truly secure is also virtually impossible. The stack of layers that an attacker can target range from your own code, to library code (Heartbleed), container escape (maskedPaths abuse), OS (Dark Sword, Ghost Tap), hardware (Spectre, Rowhammer), etc. Security is really hard. Fortunately exploiting these things is also hard.

The belief that something is more likely to be secure because it's code instead of a prompt is likely only avoiding one particular type of attack. That's a win, but you probably shouldn't think of it as meaning your code is actually secure.

ethin · 2026-05-31T19:37:52 1780256272

Wait really? I'm not really sure what to think and I posted before I saw this... I wonder why the limit is so low?

ethin · 2026-05-31T19:36:46 1780256206

Ouch, looks like the HN hug of death struck again. Gives me error 429.

ethin · 2026-05-31T18:55:28 1780253728

The problem is what is the alternative? I'm (not) defending them or this practice by any measure, but we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system. I've hated CAPTCHAs ever since I first encountered them and I can't wait for them to just finally die a permanent death, but I also don't know how we solve the "how do you identify a human and a bot" in a way which doesn't require server admins to have extremely beefy servers or similar setups to handle the extra load. I'm not going to do the "there HAS to be a way thing" either because, for all I know, this could just be one of those impossible-to-solve problems.

jwr · 2026-05-31T19:08:10 1780254490

> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system

No, we don't know. I honestly do not understand the problem. I run websites, both static and non-static. Granted, my sites aren't exactly the most popular internet go-to destinations, but I should be seeing this DDoS too, right?

I do see lots of requests. Nothing that any modern system can't handle. Computers are stupid fast these days. Unless you are doing something unreasonable, it's really hard to even notice this "extra load".

I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.

I think too many people are annoyed by AI companies (arguably understandable position), look at their logs and speak of "hammering", "DDoS" and "extra load", while in reality it doesn't matter much.

acdha · 2026-05-31T21:10:33 1780261833

We do know, just ask anyone who runs a more popular site or does anything where abuse can be monetized (shopping, reviews, etc.). Avoiding that due to obscurity isn’t an answer because it’s saying you’re safe until something, possibly outside of your control, causes the bots to descend and give you an extra 500M requests with no chance of revenue.

I’m with OP: I don’t like this but the alternatives all look like the death of the open web.

handoflixue · 2026-05-31T21:37:47 1780263467

> just ask anyone who runs a more popular site

The person you're responding to already said they ran a modestly sized site. What actual scale opens one up to abuse? If only the top 1% of sites need it, then it seems silly to say "everyone" needs it.

ceejayoz · 2026-05-31T22:34:01 1780266841

It’s not just scale. Do you accept user generated content? If so, more of a target.

wizzwizz4 · 2026-05-31T23:37:27 1780270647

Stack Overflow was outside of the Cloudflare network for years, and anti-abuse was maybe 3 or 4 full-time jobs – much of which still needs to be done, because Cloudflare's anti-bot protection hasn't actually stopped it. Most UGC sites are not as big as Stack Overflow was at its peak.

ceejayoz · 2026-06-01T11:38:29 1780313909

Most UGC sites also don't have a horde of volunteer mods voting to close/delete things.

wizzwizz4 · 2026-06-01T12:42:03 1780317723

I'm referring specifically to the activities of Charcoal (https://charcoal-se.org/) and their Stack Exchange staff counterparts, taken together. This is about large-scale platform abuse, of the sort that Cloudflare is alleged to prevent (but doesn't, really), not the more mundane (and laborious) task of manual quality control.

black_puppydog · 2026-06-01T01:51:07 1780278667

errr... so anything related to UGC now has a lower bound of 3-4 FTE? Sure, I'll hire a team of content moderators next time I think about putting a comment form under my blog...

Dylan16807 · 2026-06-01T02:45:11 1780281911

Please read their last sentence again and think about how much it understates the difference between stack overflow in its prime and a normal website. Also the "much of which still needs to be done".

GoblinSlayer · 2026-06-01T08:35:54 1780302954

Yes? Cloudflare doesn't replace moderators. At all. It only allegedly filters bot generated content, it doesn't filter user generated content and doesn't even intend to.

daishi55 · 2026-05-31T22:27:49 1780266469

So everyone is paying cloudflare… why?

tardedmeme · 2026-06-01T10:21:36 1780309296

Because paying with MITM is far less visible than paying with money

fragmede · 2026-05-31T22:34:32 1780266872

Most likely not. Their free tier is fairly generous.

LgWoodenBadger · 2026-06-01T16:48:20 1780332500

Because charging for bandwidth/traffic is still a thing, unfortunately

matt_heimer · 2026-05-31T20:35:50 1780259750

It might depend on the tech stack. I run a small niche website but it has PHP and a database (MediaWiki/PHPBB) and without Cloudflare I'd estimate I'd need to spend several hundred dollars a month to handle the traffic. Traffic used to be tens of thousands of requests a day. AI has increased that to between 400k and 3M requests per day but it's not a smooth distribution. This is with bot fight mode on that greatly reduces traffic.

I adopted Cloudflare because it was getting DDoSed by the AI crawlers. I'm pretty sure all of them are vibe coding their crawlers and don't bother adding rate limiting as a requirement.

jwr · 2026-06-01T14:23:09 1780323789

That was my point. I was trying to be gentle by mentioning "unreasonable" things, but seriously — how did we get to the point where less than 6 requests per second (that's 500k requests per day) is considered a DDoS?

I've spent some effort on optimizing my sites, but most of the effort was focused on avoiding unreasonable (stupid) work. Do I need a session for every request? No, I don't! Do I need a database fetch for every access to my homepage? No, I don't! Is it a problem to actually load all of my static content in all supported languages (24) into memory and serve it from memory? No, it isn't!

I use Clojure behind nginx on the server for my sites. Oh, and I also pre-compress all static assets to Brotli, so anything that handles brotli gets a static file served directly from nginx. I also use immutable assets with unlimited caching semantics.

Really — the problem is that we've grown lax and our software has become bloated, slow, and with unreasonable code paths. If every page fetch does 12 database accesses and runs through a slow interpreter, that is surely going to be a problem.

matt_heimer · 2026-06-02T17:26:27 1780421187

That's the traffic after rate limiting controls and bot fight mode. It's 3-4 million requests per day without bot fight mode and just rate limits. And as I said it's not a smooth distribution. Plus the requests are almost never for pages in cache. It's always stuff like loading all the message threads from the year 2000 or loading up the details of every page edit ever made to a wiki page.

If it was more static content it'd be easier, it's really the db being a bottle neck in a dynamic site.

Yes, the software could be better optimized but then I'd have to own the development of it. There is no reason a niche website should be getting millions of requests per day.

canyp · 2026-05-31T20:53:12 1780260792

I second this. My website exposes a cgit and 99% of the traffic now is AI scraping the sources, but the load is nowhere near DoS territory. And this is running on the cheapest VPS I could find.

Not saying I'm not annoyed by the scraping; I am looking to block them, but I'm also not going to put the site behind the gatekeeper. If anything, Cloudflare must love AI scraping now for the same reason AV companies love malware.

Now, if you are running a PHP stack...yeah, maybe that's the problem right there.

lxgr · 2026-05-31T22:14:14 1780265654

Is there actually any plausible theory why "AI" would repeatedly scrape the same sites? Are there that many competing, completely independent AI labs? Is it cheaper to repeatedly scrape than to buffer the scraped data locally? (I find it very hard to imagine that it's easier to deal with changing/disappearing content than it is to stand up such a cache.)

jack_pp · 2026-06-01T00:38:28 1780274308

If you ask an agent to check sources / function definitions of open source packages it will wget / curl it

GoblinSlayer · 2026-06-01T09:04:36 1780304676

It's an AI generated scraper that scrapes nonstop.

ern_ave · 2026-06-01T15:18:55 1780327135

> 99% of the traffic now is AI scraping the sources

I wonder if we should stop fighting this and instead create an API specifically for this purpose? Or, a central repository that you could send your data to and say to anyone wanting to scrape, "safe yourself some time and just get my data from this other place"

canyp · 2026-06-02T00:18:02 1780359482

The thing though is that they are extremely idiotic. They are constantly, recurringly, scanning the same files, I suppose out of FOMO that a line might have changed. I don't know what a special API solves, especially because HTTP already has etags to save you from re-downloading the whole damn file over again. But these bots don't care. The extent to which they don't care is such that, after I temporarily took cgit down for kicks, they'd get 404s and still repeatedly ask for the sames files days on end.

account42 · 2026-06-01T10:05:39 1780308339

The PHP stack isn't even the problem, it's having unauthenticated requests getting past the cache in the first place, something that most sites should be able to prevent.

JohnTHaller · 2026-05-31T23:00:55 1780268455

If you're in any way semi-popular and a decent size, you're gonna get hammered. PortableApps.com was partially offline for weeks due to China-based AI scrapers. You block the useragent, they start hitting you with another one from the same IP in the same way. You block the IP, they switch to another. You block the subnet, they use another. At one point it was nearly a thousand different IPs from around China hammering away. For all intents and purposes, a DDoS. This wasn't a little "extra load", this was load that was thousands of times beyond what our legitimate userbase was using.

And if you're thinking about blocking all of China, while this particular AI bot didn't use them, a bunch of other ones I've encountered use VPNs and hacked clients worldwide.

hombre_fatal · 2026-06-01T01:32:12 1780277532

Consider yourself lucky. But don't let yourself fall into the trap of thinking it's a nonissue for everyone else until it happens to you.

People shouldn't have to be experts or provision a larger server to run a UGC service that can withstand the sort of 30x more traffic I'm seeing from AI bots. Or rather, you didn't render the argument for why they should have to do that if they can just use CloudFlare's free tier.

Either way, it's easy to have all the answers when you've never had the problem.

ethin · 2026-05-31T19:34:35 1780256075

Has anyone pointed an AI scraper at your server at all? Unless your website appears in search engine listings I don't think the AI scrapers will slam it. My server has never been hit by them but my server is also practically unknown. All of this said, I'm not going to claim that server loads can handle it because many sysadmins have claimed otherwise, and I would like to think that their claims are reliable.

redox99 · 2026-05-31T19:42:32 1780256552

As soon as you get your TLS certificate you get bombarded with scraping. You don't need someone to "point a scraper at you".

What matters most is usually how much there is to scrape. If you have like 5 pages that's nothing. For forum like websites where each thread, each user profile, etc. gets scraped that's when traffic increases. I just let them have at it with no issues though, computers are fast.

ethin · 2026-05-31T22:10:14 1780265414

That's really weird. My experience is quite different: I have several subdomains and all of them have TLS certs and I haven't (yet) seen this (thankfully). Either that, or my server is masking it. The weird thing is that my server is an OVH dedicated box that doesn't exactly have top-tier specs, so I have no idea what's going on there. Very weird indeed.

redox99 · 2026-05-31T22:48:18 1780267698

Probably you don't have much to scrape?

ethin · 2026-06-01T00:33:37 1780274017

I mean... It may be that most of the things I run aren't really scrape-able. I run Matrix (which requires authentication), an XWiki instance, Zulip, Terraria, Forgejo, Nextcloud, a Mastodon server... Most of those require auth behind my Kanidm instance to actually do anything. Well and most of them have APIs that are much better than "scrape the universe".

GoblinSlayer · 2026-06-01T09:15:04 1780305304

If you run the site on a custom port, scrapers won't find it?

userbinator · 2026-05-31T19:42:37 1780256557

Also, how do we even know they're really "AI scrapers", or just a deliberate DDoS to push sites into using CF or other "anti-bot" providers?

danielheath · 2026-05-31T22:09:02 1780265342

They showed up when the AI money did. The evidence is circumstantial, but… some of them are remarkably well engineered (from a “how difficult is it to identify this traffic” perspective, in a way that never existed before (I have been running a quite sizeable site for 8 years, over 200k registered users, and you don’t need to register to use 99% of it).

whstl · 2026-06-01T01:49:03 1780278543

I run a quite large website and there are a few patterns.

The usage is extremely quick, and follows easy-to-spot patterns. We noticed a spike in bounce rate.

They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.

Then there's the crazy spikes in visits from specific countries, pretty much scraping the entire content. Often from pools of IPs. In some cases had 30% unexplained (meaning: it wasn't viral or a marketing campaign) random sustained increases in traffic.

There's also the fact they don't interact with the complicated widgets, so zero XHR requests other than analytics pings.

They also don't cause spikes in Google Analytics, so I assume it's blocked, but they show up in logs and in the internal analytics.

It's not enough to DDOS the website at all, but it's a lot of noise in statistics that we gotta learn to filter.

JimDabell · 2026-06-01T05:37:59 1780292279

> They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.

I’ve triggered this kind of “bot protection” right here on Hacker News many times. I did that by having a bunch of Hacker News pages open and then closing and reopening my browser. I’ve also triggered it by opening a bunch of links in the background too quickly. I’ve also triggered it by reading the article, then clicking back and upvoting/favouriting too quickly. I’m also located in Singapore, which people have started to advocate for blocking here recently.

A single non-bot legitimate user can easily trigger these kinds of heuristics just by using the site in a way you don’t expect. This can affect some users disproportionately more than others, e.g. disabled people who need to use assistive technology.

whstl · 2026-06-01T08:25:27 1780302327

Oh I also do this all the time.

What I mean by "too fast" is opening 50 pages in the span of two or three milliseconds.

Either way, I'm not blocking. The CDN is handling the traffic alright.

danielheath · 2026-06-01T08:54:01 1780304041

I hate that sort of thing - when I rolled my own proof-of-work bot protection (providers wanted $$$$), I set it up so that

A) you'd have to open >200 tabs, and B) if any tab solves the proof-of-work, any that are still waiting to do so reload in the background.

userbinator · 2026-06-01T00:04:59 1780272299

Yes, circumstantial is exactly the point; it's easy to use AI as a scapegoat because it's something popular to hate on.

danielheath · 2026-06-01T01:24:13 1780277053

It's circumstantial evidence, but Occam's Razor also applies.

It's not a hostile DOS in the traditional sense (I've mitigated a few of those) - no "pay us to make it stop", no pattern to the requests other than "fetch every unique URL a few times".

It wasn't happening until financial incentives to gather large datasets for AI training appeared.

Bad actors (using residential proxies & claiming to be a real browser) mostly showed up after folk started blocking ones that identified themselves as AI scrapers.

It's obvious to blame AI training because there's a shortage of better explanations. Who else would be paying for these (expensive) residential botnets, only to use them to (eg) web-scrape wikipedia (which offers free downloads of its content in a structured format)?

The simplest explanation of the technical behavior is "a bot coded to follow every link it sees & save the results", and the simplest explanation of the motive to run such a bot is "to train a large language model".

userbinator · 2026-06-01T02:03:16 1780279396

no "pay us to make it stop"

"use Cloudflare to make it stop"

danielheath · 2026-06-01T03:31:37 1780284697

Or fastly, or akamai, or bunny, or any number of other providers.

Cloudflare are merely the cheapest of the bunch.

userbinator · 2026-06-01T04:40:34 1780288834

Exactly. They (and most of all, Big G) stand to profit greatly from this browser discrimination. What better than to make more sites use them by launching DDoS attacks in the name of "AI scraping".

dr_um · 2026-05-31T20:13:45 1780258425

A small, single EU country focused non-static e-commerce, with proper robots.txt instructions that worked perfectly well in the search & co bots -only "era" with rate limiting for nginx/php-fpm setup - is kinda struggling without CF to handle 15000 requests per 15 minutes, coming from Chrome "users" from IPv6. Best so far was an avg. server load in htop = 40 on an 8-core server x_x

redox99 · 2026-05-31T22:56:00 1780268160

That's 16.6rps. A single guy holding the F5 key on chrome can generate that much traffic and take down your website. That kind of performance was never acceptable.

bschwindHN · 2026-06-01T02:00:49 1780279249

People will always reframe their request numbers to avoid stating their pitiful requests per second numbers, it's hilarious. "This thing is handling hundreds of thousands of requests per day!" Like cool, you're barely making it double digit requests per second.

PunchyHamster · 2026-05-31T21:33:00 1780263180

> handle 15000 requests per 15 minutes,

that's just ~17 req/sec

That's "cheap VPS running wordpress" level of traffic

BenjiWiebe · 2026-06-01T11:41:50 1780314110

Maybe a plain WordPress install. Run something like WooCommerce and install a bunch of plugins to get the functionality that WordPress and WooCommerce should have built-in, and suddenly a cheap VPS can only handle 2 or 3 requests per second.

It's phenomenal how inefficient the WordPress/WooCommerce stack is.

Though the main issue I'm seeing is credit card testing, not scraping.

And I'm ideologically opposed to using a CDN (because it shouldn't be needed for such a small site!) so it's somewhat a self-inflicted problem...

PunchyHamster · 2026-06-01T12:16:22 1780316182

"Security" plugins are also HUGE problem here, most of them turns "few cached DB SELECTs" (or static file read if you use caching plugin) into now a bunch of inserts, just to log/analyze "offender" IP and maybe block it, in many cases turning "blocking offender" to be more costly that would be serving the page without the security plugin

codedokode · 2026-06-01T03:07:34 1780283254

You can calculate traffic stats for a day by IPs/subnets and probably bots will stand out. If they are using IPv6 you can figure out the ASN and block it completely.

canyp · 2026-05-31T21:43:33 1780263813

Block out IPv6 and see if that helps.

lxgr · 2026-05-31T22:16:08 1780265768

Why not block all odd v4 addresses while you're at it? I heard that that can reduce scraping volume by 50%!

ssl-3 · 2026-06-01T00:35:40 1780274140

That's harder to set up, and also unfair to people who have an odd IP address.

It's easier and better to just block 0.0.0.0/1 half of the time, and 128.0.0.0/1 for the other half of the time. Switch every day at noon.

Bot traffic will be cut by 50%, and humans are all treated equally! It's a total win!

tardedmeme · 2026-06-01T10:24:54 1780309494

And blocking ipv6 addresses isn't unfair to people who have an ipv6 address?

ssl-3 · 2026-06-01T18:49:50 1780339790

Yeah, I suppose you're right.

Just block it all.

ipaddr · 2026-06-01T00:34:36 1780274076

Blocking Singapore reduces the AI load 90%.

redox99 · 2026-05-31T19:39:18 1780256358

You get downvoted for these opinions but I agree. Most people that complain that their servers get hammered by AI bots are those that run very unoptimized servers that can only handle like 100 rps. I've never had any issues with any of my moderately optimized websites. A $10 VPS can handle sooo much traffic.

CodeBytes · 2026-05-31T21:28:09 1780262889

I think people get annoyed when it's suggested they spend time optimising or even re-writing their websites to handle high traffic loads just to cater to AI bots ripping their content.

It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.

The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.

They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.

redox99 · 2026-05-31T22:44:08 1780267448

What I do is have more strict rate limits for non logged in users. You tell them to log in if they hit the rate limit. For non logged in users, you have a rate limit not just for IP, but also for /24 and /16. Forget about IPv6, IPv4 scarcity is a feature not a bug.

CodeBytes · 2026-06-01T02:52:58 1780282378

The bot I had was using unique IPs for each request. Some were from cloud providers but most were just random residential ISPs. I couldn't see any obvious connections so rate limiting would've had to be a global rate limit.

Similar to the one SQLite had: https://www2.sqlite.org/forum/forumpost/7d3eb059f81ff694?t=h

Each IP only makes ~1 request though so easy to detect after the fact.

I guess they will run out of IPs at some point so maybe if I had logged each one forever and shown a challenge only to them, it would have fixed it eventually. Just depends how big their pool of IPs is.

redox99 · 2026-06-01T07:12:31 1780297951

You were getting 1k rps, and each request was from an unique IP? So after an hour you got hit by 3.6M different IPs? And all from uncorrelated /16s? That seems hard to believe. Not that I don't believe you, it's just hard for me to grasp that whoever was scraping you had such a large and distributed swarm.

tardedmeme · 2026-06-01T10:26:14 1780309574

This is called rotating residential proxy service. You can buy it off grey market sites that are probably getting it from botnet operators. It costs about $2-$5 per GB.

redox99 · 2026-06-01T17:25:16 1780334716

Interesting, that definitely seems to be it.

canyp · 2026-05-31T21:47:59 1780264079

Curious, but how do the bots figure out the combinations? Or do you have links to the diffs from other sites? I assume the diff takes two files in query parameters or something.

CodeBytes · 2026-06-01T02:26:45 1780280805

I'm not 100% sure but I think links. There's a bunch on the history and revision pages. Yeah, the diff URL has two revision ID's as parameters.

I did try removing some of the links without success. I guess once they have them they just keep checking.

account42 · 2026-06-01T10:23:00 1780309380

There really isn't a good reason for a wiki (or git host) to provide diffs between arbitrary revisions to unauthenticated users. Limit it to diffs compared to previous (which can be cached) and this problem goes away.

In any case, such labyrinths of expensive dynamically generated pages are no excuse for subjecting people requesting the start page to bot checks.

Velocifyer · 2026-06-01T12:21:17 1780316477

I see many mediawiki wikis (like the Arch Linux wiki) using anubis succsefully. It can be configured to only act on certain paths.

Dylan16807 · 2026-06-01T02:53:59 1780282439

I managed to solve my scraper problems without optimizing much, but if I had to optimize I think the only option might be "don't use mediawiki" and that's an extremely obnoxious solution. Though maybe I could get there by throttling specific kinds of pages.

piker · 2026-05-31T22:10:48 1780265448

Same. Tritium and the blog have done stents on the front page here and high traffic subreddits and that plus bots has never been a problem. UX could be improved through a CDN but even that isn’t worth the trade-off for us at the moment.

RHSeeger · 2026-05-31T23:42:05 1780270925

> I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.

There are. They're not. They can't (without significant effort)

xg15 · 2026-05-31T19:45:57 1780256757

I don't think it's just privacy, it also increasingly turns the web itself into a walled garden. The end result is that websites can only ever be accessed by "approved" clients - the latest Chrome, Edge, Safari and if you're lucky Firefox - and nothing else.

robertlagrant · 2026-05-31T22:03:00 1780264980

> and if you're lucky Firefox

I haven't had any problems with Firefox so far. Why do you say this?

xg15 · 2026-05-31T22:08:34 1780265314

That was more a (gloomy) outlook into the future, given Chrome's market dominance and tendency for unilateral actions in web standards.

robertlagrant · 2026-05-31T22:11:55 1780265515

I haven't ever noticed Cloudflare having any issues on Firefox, so presumably that implies any unilateral actions in web standards have been worked around by CF to provide the service to Firefox as well.

amatecha · 2026-06-01T00:48:02 1780274882

I'm pretty frequently blocked by Cloudflare when I use Firefox on OpenBSD -- apparently it's too suspicious of a combination for their liking, or something. Even on Linux I've occasionally had issues. I've had to email site operators to ask them to change their configuration so I can actually be a customer of their business.

robertlagrant · 2026-06-02T13:25:33 1780406733

Oh dear. That is tricky. It must be a rare enough combination that it looks like automation.

account42 · 2026-06-01T10:28:03 1780309683

It's already a problem with Firefox + some essential web condom extensions.

adgjlsfhk1 · 2026-06-01T01:31:52 1780277512

I think there's some chance we get a "proof of purchase" system where there is some entity that takes a $10 payment to give out a unique identity token that you need to present to visit most sites. if you have a revocation process for ones used for bad actors, it seems like it would work pretty well.

tardedmeme · 2026-06-01T10:27:56 1780309676

That's called an IP address. You pay your ISP $50+ every month to get one. Has it worked so far?

BenjiWiebe · 2026-06-01T11:47:46 1780314466

If the bad guys also had to pay $50/month/IP it would probably work.

The bad guys don't pay that much. And sometimes the bad guys actually use the IPs of other people (botnets on residential IPs) and don't pay anything at all.

tardedmeme · 2026-06-01T16:10:39 1780330239

They pay something. You can get a few ten cents per gigabyte for a voluntary proxy right now. I've never tried it long enough to get a minimum payout, so could be a scam for all I know (or maybe the minimum payout is the scam).

What would stop you offering someone a few tens of cents per GB to borrow any other token barrier you put up?

patrakov · 2026-06-01T05:32:13 1780291933

Except if your country is under sanctions.

account42 · 2026-06-01T09:57:07 1780307827

> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system

So delegalize it. Strip searching everyone to paper over the fact that the societal contract has been broken only delays that.

codedokode · 2026-06-01T03:05:51 1780283151

> AI bots which hammer servers

You can easily calculate which IPs/networks bots are using by looking at where most traffic comes from and who requests lot of pages with non-human speed.

AlexeyBelov · 2026-06-03T15:34:16 1780500856

Each IP address is either from a residential proxy network, or from AWS / GCP / DigitalOcean. And each IP requests at human speed. 1000 of them are an issue though.

codedokode · 2026-06-03T19:02:45 1780513365

If you aggregate over a day, it might become more obvious. Also, datacenter network is a big red flag.

By the way, what's your opinion about running a cryptominer on requests from datacenter and bot IPs?

PunchyHamster · 2026-05-31T21:31:35 1780263095

We have few dozen websites, from ones doing single digit Mbit to few Gbits.

Never needed it. Just put the worst offenders in penalty bucket and that's usually enough

hem777 · 2026-06-01T07:56:39 1780300599

The alternative is not have that one choke point that can be hammered. Decentralize.

arrty88 · 2026-06-01T03:29:36 1780284576

I use CF and i don’t enable these anti bot measures. It’s up to the web master

Unit327 · 2026-06-01T06:58:37 1780297117

Anubis is one alternative, kinda sucks that we need to slow down the web for everyone a little bit though.

steelframe · 2026-05-31T20:29:19 1780259359

The most plausible near-term path is probably micropayments embedded invisibly in AI agents. Your agent that has learned what you value and can make a reasonable decision to allow a micropayment for certain content pays on your behalf without requiring a conscious decision each time, eliminating the mental transaction cost problem entirely. It's the mental transaction cost that arguably led to the failure of the micro payment model back in the early 2000s.

Although the cynical part of me says that this will result in malicious actors trying to trick agents into giving out a bunch of micro payments. There are counter defenses that can help detect and compensate for that, but perhaps the best we will be able to do is prompt user with the default agent recommendation.

ethin · 2026-05-31T05:43:47 1780206227

It may always happen but it would happen less if we updated patent laws to fine people who filed invalid patents or enforced some kind of similar punishment. If you file a patent, it's up to you to verify that your patent is actually valid, and the courts shouldn't have to do that legwork for you. It also doesn't help that the patent office/components of governments don't review patents as thoroughly as they used to. Same with trademarks.

zamadatix · 2026-05-31T06:38:33 1780209513

I generally don't like the current patent law but it sounds a bit off to pay the government & wait for them to review your patent claim and then get fined by the government when both of you were wrong about it. There are already processes to additionally fine a company bringing about a truly frivolous patent lawsuit, it's just rare because usually it's not so cut and dry as we'd like it to be.

ethin · 2026-05-31T06:53:49 1780210429

I mean my idea isn't the only one in that solution space. My reasoning was to ensure that the government actually reviewed the patent and ensured it was valid instead of rubber stamping it. Or, even better, the filer of the patent application would do that. Although the best is probably to make software unpatentable anyway.

zamadatix · 2026-05-31T10:23:01 1780222981

Yeah, I wouldn't mind getting rid of software patents. Or, at least, making the patent reviews more rigorous before assignment.

shmerl · 2026-05-31T06:30:31 1780209031

Even better, software patents should not be allowed in the first place.

lofaszvanitt · 2026-05-31T06:54:10 1780210450

Why patents is a problem, please explain. If you build something that has been patented, then, well, you pay the per piece fee on it.

ZeroGravitas · 2026-05-31T07:23:36 1780212216

> In the 1980s, when IBM accused Sun of violating seven patents, Sun examined the patents and argued that IBM didn't have a case. The reply of IBM's lawyers was "maybe you don't infringe these seven patents. But we have 10,000 U.S. patents. Do you really want us to go back to Armonk [IBM headquarters in New York] and find seven patents you do infringe? Or do you want to make this easy and just pay us $20 million?" And Sun paid out.[4]

lofaszvanitt · 2026-06-01T09:54:55 1780307695

Big corpo shenanigans, from the 80s. Any recent quotes?

shmerl · 2026-06-01T15:41:50 1780328510

It's pretty much the modus operandi of any patent troll or in more general case - of any racketeer. Nothing new about it.

https://www.youtube.com/watch?v=Zz0_r4PDyCs&t=1m30s

lofaszvanitt · 2026-06-02T15:29:55 1780414195

That doesn't warrant a general ban on patents, just because some out of touch living in a fantasy reality kids have an erection saying this.

shmerl · 2026-06-02T15:38:08 1780414688

Ban on software patent is more than warranted, only trolls themselves don't like such idea obviously.

But what also needed is persecution these patent trolls for racket like any mob racketeers. There were attempts of persecuting them this way in the past, but they should be renewed.