I fear for the unauthenticated web
117 points by SethMLarson 3 months ago | 115 comments- cxr 3 months agoPerversely, this submission is essentially blogspam. The article linked in the second paragraph, to which this "1 minute" read adds almost nothing of value, is the main story:
<https://thelibre.news/foss-infrastructure-is-under-attack-by...>
394 comments. 645 points. Submitted 3 hours ago: <https://news.ycombinator.com/item?id=43422413>
- btown 3 months agoBut also ironically, it's almost heartwarming these days to see blogspam that's not machine-generated! A real live human cared enough about an article to write a brief (perhaps only barely substantial, but at least handwritten) take on it!
It's reminiscent, perhaps, of the feel and motivation for Tumblr reblogs - and Tumblr continues to be vibrant by virtue of this culture: https://www.tumblr.com/engineering/189455858864/how-reblogs-... (2019)
Now, is driving attention and reputation to their site (in the broadest senses) part of a blogspammer/reblogger's motivation? Absolutely!
But should we be concerned about rewarding their act of curation, as long as there is at least some level of genuine curation intent? A world where that answer is categorically "no" would be antithetical, I think, to the concept of the participatory web.
- dkkergoog 3 months ago"heartwarming ... To see blogspam" the internet was a mistake
- wongarsu 3 months agoThe internet was great, everything we did with it in the last 20 years with it was the mistake. Collimating in a comment that blogspam can now be one of the positive notes in the hellscape we are building.
A very useful hellscape though, for all its flaws
- 3 months ago
- wongarsu 3 months ago
- 3 months ago
- dkkergoog 3 months ago
- MisterTea 3 months agoI dont feel this is blog spam it's more of a quick comment of the situation pointing to the actual article. I dont see anything wrong with writing a short post boosting or commenting on another article. There are no ads so I dont see this as blogspam which I associate with financial gain or clout.
- tempfile 3 months agoIt also linked to https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali..., which is another worthwhile read.
- Cheer2171 3 months agoAll the time I see links on HN front page to Twitter and Mastodon posts with just as little text to them. Why does it upset you when it is in the medium of blogs, but not micro blogs?
- SethMLarson 3 months agoHehe, just participating in POSSE :) Funnily enough the story you're linking to quotes me with pictures of a story I wrote (https://sethmlarson.dev/slop-security-reports) about LLM-generated reports to open source projects.
- btown 3 months ago
- hugs 3 months agoI might be naive, but I think it's time we seriously start implementing "HTTP status code 402: Payment Required" across the board.
"L402" is an interesting proposal. Paying a fraction of a penny per request. https://github.com/l402-protocol/l402
- cwmma 3 months agothis is basically what they are doing, but instead of charging actual money they are making visitors spin the CPU ideally in a proof of work problem, which has the same outcome from the crawlers perspective.
- fewsats 3 months agoI've talked with tons of publishers and all say the same thing:
"Hey, we'd happily give these companies clean data if they just paid us instead of building these scrapers."
I think there is a psychological aspect that made micropayments never work for humans but machines may be better suited for it.
- woah 3 months agoThis has existed for decades. The proof of CPU work is called "frontend frameworks"
- fewsats 3 months ago
- rambambram 3 months agoI stumbled upon this status code last year - had never heard of it before - and I bookmarked it and then forgot about it. Thanks for the reminder.
- SoftTalker 3 months agoThis is ultimately the answer. If something has value, users should pay for it. We haven't had a good way to do that on the web, so it has resulted in the complete shitshow that most websites are.
- cwmma 3 months ago
- fewsats 3 months agoThere's a real economic problem here: when someone scrapes your site, you're literally paying for them to use your stuff. That's messed up (and not sustainable)
It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.
L402 can help here.
- fewsats 3 months agoThe other obvious solution is a "web of trust" where Cloudflare just tells you "this request goes in, this one goes out".
I think the paying approach is superior (after all you make money out of people using your service) but Cloudflare is a straight forward/simpler one.
- tqwhite 3 months agoAren't you paying for me to use the site, too? Or Google? Isn't the point of paying for a web hosting service to distribute information?
- fewsats 3 months agoYes, but there is a "free lunch" problem. I can run a script that hits your page costing you X at a fraction of the cost for me (the user)
*Edit: typo
- tqwhite 3 months agoI think the whole internet is a free lunch problem as far as that goes. I pay for web hosting because I consider the cost to be worth it to send my fabulous opinions into the ether.
The premise of this thread is that somehow the LLM builders are reading too much. I bet it's less than google.
I continue to believe, if you don't want everyone in the world to see and use your stuff, don't put it on the internet.
- tqwhite 3 months ago
- fewsats 3 months ago
- fewsats 3 months ago
- Aurornis 3 months agoRate limiting is the first step before cutting everything off behind forced logins.
> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly
FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.
- snerbles 3 months agoCloudflare also locks out non-Chrome/Firefox browsers, stifling the development of alternatives.
- blibble 3 months agoI get the feeling that I'm going to read a blog post in a few years telling us that the CDN companies have been selling everything pulled through their cache to the AI companies since 2022
- Aurornis 3 months agoCDNs are a cash cow. They’re not going to set their reputation on fire and violate all of their security guarantees for negligible amounts of money.
- littlestymaar 3 months agoWhat reputation?! Cloudflare has been known for its shady practices for more than a decade now, but people just don't care.
- AshamedCaptain 3 months agoI know a lot of companies that not only willingly send their most precious trade secrets (TM) freely to shady LLM operators (like OpenAI, Microsoft, etc.) , but they even pay for the privilege of doing it ...... just out of fear of "missing out" on this Next Big Thing.
- blibble 3 months agocloudflare continues to make a loss
meanwhile: "I'm proud of how our team continued to deliver ground-breaking innovation, especially in AI" (Matthew Prince, co-founder & CEO of Cloudflare)
- 3 months ago
- koakuma-chan 3 months agoCloudflare is free
- mystified5016 3 months agoSee absolutely every other sector of industry and economy for copious counter-examples.
If there's profit on the table, capitalism will not allow it to sit there at any cost.
- littlestymaar 3 months ago
- nottorp 3 months agoAnd even if they don't, is everything depending on Cloudflare to stay online a good thing?
- sshine 3 months agoIt’s a terrible thing.
Cloudflare is the company I hate the most: I think (what I know of) their tech is done right, and they’re just too big to put my eggs in their basket.
- koakuma-chan 3 months agoWhy is nobody building a better product?
- sshine 3 months ago
- Aurornis 3 months ago
- zwnow 3 months agoUntil they threaten you to pay a huge bill or they will shutdown your services. No thanks. Cloudflare has extremely questionable business practices.
- sshine 3 months agoCloudflare took down our website: https://news.ycombinator.com/item?id=40481808
A user running an online casino claimed that Cloudflare abruptly terminated their service after they refused to upgrade to a $10,000/month enterprise plan. The user alleged that Cloudflare failed to communicate the reasons clearly and deleted their account without warning.
Quote: "Cloudflare wanted them to use the BYOIP features of the enterprise plan, and did not want them on Cloudflare's IPs. The solution was to aggressively sell the Enterprise plan, and in a stunning failure of corporate communication, not tell the customer what the problem was at all."
——
Tell HN: Don't Use Cloudflare: https://news.ycombinator.com/item?id=31336515
Summary: A user shared their experience of being forced to upgrade to a $3,000/month plan after using 200-300TB of bandwidth on Cloudflare's business plan. They criticized Cloudflare's lack of transparency regarding bandwidth limits and aggressive sales tactics.
Quote: "A lot of this stuff wasn't communicated when we signed up for the business plan. There was no mention of limits, nor any contracts nor fineprint."
——
Tell HN: Impassable Cloudflare challenges are ruining my browsing experience: https://news.ycombinator.com/item?id=42577076
Summary: A user expressed frustration with Cloudflare's bot protection challenges, which made it difficult for them to unsubscribe from emails or access websites. They highlighted how these challenges disproportionately affect privacy-conscious users with non-standard browser configurations.
Quote: "The 'unsubscribe' button in Indeed's job notification emails leads me to an impassable Cloudflare challenge. That's a CAN-SPAM act violation."
- seec 3 months agoIt's modern racketeering.
If you don't need them, they'll make you think you need them (so they can monitor your needs) and when you do need them, they will extort you any way they can.
The vast majority of websites don't need Cloudflare, very often people do because they run things in a very terrible way. Instead of paying Cloudflare extortion feed, pay competent people for proper infrastructure development.
- seec 3 months ago
- sshine 3 months ago
- dougb5 3 months agoWhat exactly should be rate-limited, though? See the discussion here -- https://news.ycombinator.com/item?id=43422413 -- the traffic at issue in that case (and in one that I'm dealing with myself) is from a large number of IPs making no more than a single request each.
- layer8 3 months agoCentralizing large parts of the web behind Cloudfare is something to be feared as well.
- harha_ 3 months agoScrew cloudflare, I rather host my own proxies.
- snerbles 3 months ago
- parliament32 3 months agoLinked in the article that this article links to is a project I found interesting for combatting this problem, a (non-crypto) proof-of-work challenge for new visitors https://github.com/TecharoHQ/anubis
Looks like the GNOME Gitlab instance implements it: https://gitlab.gnome.org/GNOME
- kh_hk 3 months agoFor targeted scrapes, isn't proof of work trivial to bypass?
1. headless browser 2. get cookie 3. use cookie on subsequent plain requests
- parliament32 3 months agoIt doesn't sound like the scrapers are that smart yet, but when they get there, presumably you'd just lower the cookie lifetime until the requests are down to an acceptable level. It takes a split-second in my browser so it shouldn't interfere much for human visitors.
- parliament32 3 months ago
- kh_hk 3 months ago
- hubraumhugo 3 months agoWe should try separating good bots from bad bots:
Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.
Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.
So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.
- jsheard 3 months agoDiscriminating search engine bots is pretty straightforward, the big names provide bulletproof methods to validate whether a client claiming to be their bot is really their bot. It'll be an uphill battle for new search engines if everyone only trusts Googlebot and Bingbot though.
https://developers.google.com/search/docs/crawling-indexing/...
https://www.bing.com/webmasters/help/verifying-that-bingbot-...
- jsheard 3 months ago
- kmeisthax 3 months ago> How long until scrapers start hammering Mastodon servers?
Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.
- jmclnx 3 months agoI would think all you need to do is add a copyright statement of some kind.
Sad things are getting to this point. Maybe I should add this to my site :)
(c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.
- jcranmer 3 months agoThe argument the AI companies are making is that training for LLMs is fair use which means a copyright statement means fuck all from their point of view. (Even if it does, assuming you're in the US, unless you register the copyright with the US copyright office, you can only sue for actual damages, which means the cost of filing a lawsuit against them--not even litigating, just the court fee for saying "I have a lawsuit"--would be more expensive than anything you could recover. Even if you did register and sued for statutory damages, the cost of litigation would probably exceed the recovery you could expect.)
Of course, the big AI companies are already trying to get the government to codify AI training as fair use and sidestep the litigation which doesn't seem to be going entirely their way on this matter (cf. https://arstechnica.com/google/2025/03/google-agrees-with-op...).
- tsumnia 3 months agoIn addition, we need to start paying attention to the growing legislation about AI and copyright law. There was an article on HN I think this week (or last) specifically where a judge ruled AI cannot own copyright on its generated materials.
IANAL, but I do wonder how this ruling will be used as a point of reference whenever we finally ask the question "Does material produced by GenAI violate copyright laws?" Specifically if it cannot claim ownership, a right that we've awarded to trees and monkeys, how does it operate within ownership laws?
And don't even get me ranting about HUMAN digital rights or Personified AIs.
- tqwhite 3 months agoFair use requires transformation. LLM is as transformative as it gets. If I'm on the jury, you're going to have to make new copyright law for me to convict.
I am personally happy to have everyone, people and LLM alike, learn from my wisdom.
- jcranmer 3 months ago> Fair use requires transformation.
No, it doesn't. There are four factors for fair use, and whether the use is transformative is part of one of them. And you don't need to win on all four factors.
> LLM is as transformative as it gets.
The current ruling precedent for "transformative" is the Warhol decision, which effectively says that to look at whether or not something is transformative, you kind of have to start by analyzing its impact on the market (and if you're going "doesn't that import the fourth factor into the first?" the answer is "yes, I don't like it, but it's what SCOTUS said"). By that definition, LLMs are nowhere near "transformative."
Even pre-Warhol, their role as "transformative" is sketchy, because you have to remember that this is using its legal definition, not its colloquial definition.
> If I'm on the jury
Fortunately, for this kind of question, the jury isn't going to be involved in determining fair use, so it doesn't matter what you think.
- jcranmer 3 months ago
- tsumnia 3 months ago
- Aurornis 3 months agoCopyright is for topics like redistribution of the source material. You can’t add arbitrary terms to a copyright claim that go beyond what copyright law supports.
I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.
- 101008 3 months agoWhat about if my index says "This are the EULA, by clicking "Next" or "Enter", you are accepting them", and a LLM scrapper "clicks" Next to fetch the rest of the content?
- aaronbaugher 3 months agoThat's how the big software companies have been doing it to us for years, so it does seem like turnabout would be fair play.
- aaronbaugher 3 months ago
- 101008 3 months ago
- jefftk 3 months agoIt's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license. This is what the https://githubcopilotlitigation.com/ class action (from 2022) is about, and its still making its way through the court. This prediction market has it at 12% likely to succeed, suggesting that courts will not agree with you: https://manifold.markets/JeffKaufman/will-the-github-copilot...
- jcranmer 3 months ago> It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license.
I would say it's not reasonably likely that LLM training is fair use. Because I've read the most recent SCOTUS decision on fair use (Warhol), and enough other decisions on fair use, to understand that the primary (and nearly only, in practice) factor is the effect on the market for the original. And AI companies seem to be going out of their way to emphasize that LLM training is only going to destroy the market for the originals, which weighs against fair use. Not to mention the existence of deals licensing content for LLM training which... basically concedes the point.
Of the various options, a ruling that LLM training is fair use I find the least likely. More likely is either that LLM training is not fair use, that LLM training is not infringing in the first place, or that the plaintiffs can't prove that the LLM infringed their work.
- tqwhite 3 months agoI do not read it that way at all. The Goldsmith decision mainly turns on the idea that an artist protections include that for derivative works. Warhol produced a work that does substantially the same things as Goldsmith's, ie, is a picture that can be viewed.
When talking about parody, they note that the usage as the foundation for parody is always substantially different from the original and thereby allowed, even if it would otherwise infringe. LLMs are always substantially different from the original, too.
If I want to write software that draws that picture exactly, the code would not be a copyright violation. It is text and cannot be printed in a magazine as a picture. If I used it to print a picture that was a derivative work and sold that, it might be.
A large language model has no intersection with the picture or, for that matter, anything that it absorbs. It is possible that someone might figure out how to prompt it to do exactly the same picture as Goldsmith did but fairly unlikely.
Unless you could show that this was easy, common and part of the intent of the LLM creator, I can see no possibility that it is infringing.
- tqwhite 3 months ago
- maeln 3 months ago> This prediction market has it at 12% likely to succeed
Randos on the internet with a betting addiction are distinctively different from a court of law. I wish people would stop talking about prediction market as if they mattered.
- eudhxhdhsb32 3 months agoParticipants in prediction market do not need to be experts for their collective input to be informative.
There's a long history of economic research on the "wisdom of crowds" that backs up their value.
- eudhxhdhsb32 3 months ago
- dingnuts 3 months agothis isn't about copyright but about computer access. the CFAA is extremely broad; if you ban LLM companies from access on grounds of purpose you have every legal right to do so
in theory that legislation has teeth, too. they are not allowed to access your system if you say they are not; authentication is irrelevant.
every GET request to a system that doesn't permit access for training data is a felony
- waveringana 3 months agowhy are we pretending that these gambling sites have any weight on anything
- eudhxhdhsb32 3 months agoWhat do you mean by weights?
I'd certainly trust their predictions more than those given by most "experts".
- eudhxhdhsb32 3 months ago
- jcranmer 3 months ago
- JohnFen 3 months agoSuch a notice is legally meaningless, though. Doubly so if the courts rule that scraping for AI purposes counts as fair use.
- kerkeslager 3 months agoThis is pretty naive.
The only reason copyright is so strong in the US is that there are big players (Disney, Elsevier) who benefit from it. But gig tech is much bigger, and LLMs have created a situation where big tech has a vested interest in eroding copyright law. Both sides are gearing up for a war in the court systems, and it's definitely not a given who will win. But, if you try to enter the fray as an individual or small company, you definitely aren't going to win.
- jasperr1 3 months agoThe reality is that a lot of these small websites have very permissive licenses. I really hope we don't get to the point where we must all make our licenses stricter.
- krapp 3 months agoThe reality is that none of these LLM scrapers give a damn about copyright, because the entire AI industry is built on flagrant copyright violation, and the premise that they can be stopped by a magic string is laughable.
You could sue, if you can afford it, meanwhile all of your data is already training their models.
- jasonjayr 3 months agoA class action, funded by their rivals could hurt quite a bit, especially for sites damaged monetarily by these LLM scrapers.
- jasonjayr 3 months ago
- krapp 3 months ago
- jeffwask 3 months agoSure, because Meta certainly followed copyright law to the letter when they torrented thousands of copyrighted books from hundreds of published and known authors to train Lama. Forgive me if I doubt a text disclaimer on the page will slow them down.
- dspillett 3 months agoUnfortunately copyright is no limit to these companies.
Meta is stating in court that knowingly downloading pirated content is perfectly fine (ref https://news.ycombinator.com/item?id=43125840) so they for one would have absolutely no issue completely ignoring your copyright notice and stated licensing costs. Good luck affording a legal team to try force them to pay attention.
Copyright is something for them to beat us with, not the other way around, apparently.
- jcranmer 3 months ago
- charcircuit 3 months agoCrawlers visiting every page on your website is not the main problem with the unauthenticated web.
The amount of spam that happens when you let people freely post is a much bigger problem.
- 3 months ago
- renegat0x0 3 months agoTo be honest I feel that web2 is overrated.
Most of content, blogs could be static sites.
For mastodon, forums I think user validation is ok and a good way to go.
- 0x1ceb00da 3 months agoDo I need to be worried about my bill if I've rented a simple EC2 instance without any fancy autoscaling stuff?
- simonw 3 months agoProbably not. Keep an eye on bandwidth usage since you'll be charged for that but you would need to attract an incredible amount of bot traffic for that to add up to anything meaningful.
The thing to watch out for is platforms like Vercel or Google Cloud Run where you get charged more for compute if you attract crawlers, potentially unbounded (make sure to set up spending limits if you can.)
- simonw 3 months ago
- MontgomeryPy 3 months agoCould an answer here be for smaller websites to convert their sites into chatbots which could prevent AI scrapers from slurping up all their content/drive up their hosting costs?
- cwmma 3 months agono
- cwmma 3 months ago
- napolux 3 months ago> I suggest everyone that uses cloud infrastructure for hosting set-up a billing limit to avoid an unexpected bill in case they're caught in the cross-hairs of a negligent company. All the abusers anonymize their usage at this point, so good luck trying to get compensated for damages.
This is scary
- jsheard 3 months agoWhat's scarier is that most of the big clouds don't even let you set up a billing limit.
- jsheard 3 months ago
- anovikov 3 months agoPretty soon virtually everything will be paywalled. Ironically, it will provide us with a good metric that lets us find out whether AGI has arrive or not: when it does, paywalling will stop working because AGI could derive more value from accessing things and will thus outbid us.
- woah 3 months agoIf you don't want someone to access your website, don't put it online
- isoprophlex 3 months agoEveryone is (rightfully) outraged, but this is essentially nothing new. Asshat capitalists have been externalizing the costs of their asshat moneymaking schemes on the little guy since approximately forever.
Deregulation is ultimately antithetical to our personal freedom.
I just hope the spirit of the internet that I grew up with can be rescued, or reincarnated somehow...
- ToucanLoucan 3 months agoYet another entry in the long and shameful history of Silicon Valley abusing the public square for their own profit (or in this case, fantasies of profit) and the rest of us just have to learn to live with it because the justice system simply will not even try and give us recourse.
Move fast and break things apparently has a bonus clause for the things you break not being your responsibility to fix.
- Analemma_ 3 months agoI don't think the justice system is the one to blame here. Right up until LLMs and their huge datamining operations appeared, everyone in tech was strongly for unrestricted scraping. Everybody here cheered the LinkedIn decision [0], saying "it's on the public web: if you didn't want it to be scraped, you should've put it behind authentication". LLMs change nothing about the legal landscape, they've just convinced everyone on an emotional level that unrestricted scraping is no longer an automatic good. It's not the justice system's job to react to such vibe shifts, the laws themselves have to be changed.
- ToucanLoucan 3 months agoI'm not talking about the ethics of scraping itself. I think scraping is fine from an ethics perspective for exactly that reason. I think LLM companies, however, are scraping ineptly and with poorly implemented tooling, which is causing problems for the websites they're targeting, and that sucks ass and they should be held liable.
On the legal end though, I do think there's a few things that should be done:
* Scrapers should be CLEARLY, and CORRECTLY identified as what they are, and who they are being dispatched from. Changing user agents to get around blocks should not be permitted, ever. If you only get a certain amount of content or a certain subset of pages when you identify as a scraper, that is a choice the website operator is making and it should be respected.
* Scrapers MUST OBEY robots.txt. We didn't create that for a fun hacker weekend. It's an important technical component of how we organize websites and how we want them crawled, if we want them crawled. It should be the first stop for any scraper on any website, and again, it should be respected.
* Scrapers should always meter their traffic with respect to the website owner. Pounding an entire website's library of content request after request with only milliseconds between is, to put it bluntly, being a fucking asshole. And not just to the owner, but to anyone else attempting to use the site at the time.
If a website operator configures their site incorrectly and pages they don't want scraped are, or pages they do want scraped aren't, then that is on them and they need to fix that. It is not in the scraper's purview to end-run around that configuration to "be real sure" they got everything they were meant to, and it's especially not that to get things the web operator has explicitly tried to not let the scraper have.
And yes, all of these things should be legally actionable, with financial penalties attached and for serial offenders, we should have a registry of scraper bots that we disallow entirely because they are acting in bad faith.
- kerkeslager 3 months agoScraping is only part of the problem with LLMs. I don't care if you scrape my public data. The problem is re-publication, without even so much as attribution. LLMs should not be taking credit for my work.
- ToucanLoucan 3 months ago
- dannyobrien 3 months agoI feel like there's been a lot of assumptions going on, but not much testing. For instance, somebody has said that a lot of these bots are coming from Chinese IP ranges. Is that true? What percentage, vs. say Amazon regions? I would love more data!
- kerkeslager 3 months agoFrankly, I don't care.
I didn't give any LLM permission to train on my data, Chinese or otherwise. It's theft and I have zero recourse to do anything about it.
- eudhxhdhsb32 3 months agoIf you don't want others to use your data, perhaps you should have kept it private?
- eudhxhdhsb32 3 months ago
- kerkeslager 3 months ago
- Analemma_ 3 months ago
- paranoidroid 3 months ago[dead]
- JKCalhoun 3 months agoFor some reason I am not really moved by a lot of the hand wringing I am seeing lately.
It's a not a binary thing to me: LLMs are not god, but even without AGI, they have proven wildly useful to me. Calling them "shitty chat bots" doesn't sway me.
Further I have always assumed that everything that I post to the web is publicly accessible to everyone/everything. We lost any battle we thought we could wage some 2+ decades ago when web crawlers started hoovering up data from our sites.
- prosody 3 months agoThis article isn’t about that. It’s about the externalized costs that LLM companies are pushing onto webmasters because of their aggressive scraping. It’s one thing to believe that LLMs are a good thing, it’s another thing to believe that individuals and cooperative groups that run small internet services ought to be the ones to pay for that good.
- theamk 3 months agoIt's not about secret vs public, it's about resource overload on the websites. Existing crawlers so far mostly respected robots.txt, the LLM crawlers don't.
You, as a user, might not care, but as servers keep going down, more and more website owners start blocking LLMs. Good riddance, hopefully all good stuff gets locked down.
Or to use an analogy, your comment is similar to: "sure those delivery vans violate speed limits and occasionally hit the pedestrians. I don't care, those fast deliveries have been proven wildly useful to me"
- 101008 3 months agoI have published a lot of content about a particular topic online and I want it to be publicly accesible. A lot of people use it to create YouTube videos, that's fine (and a lot of them even cite me). I have a problem with LLM profiting from them.
Which now I realize is not different from people amaking YouTube videos. I feel there is a difference but I don't know how to explain it. Maybe there isn't. Ouch, writing this comment was not a good idea...
- chunky1994 3 months agoThis difference in emotional reaction is because of the effort involved in the process. Functionally, we see YouTube video creation as a fundamentally difficult exercise (to do well) and results in a singular product (one video). Any additional content would need an ongoing investment of time and money from the creator. The LLMs though would not require an ongoing investment beyond the first training run, that is probably why you have a problem with it, they're an extremely high leverage way of taking advantage of content.
- ergonaught 3 months agoYou are confronted with automation.
Individuals who have to do work in order to use your content to do work to create their own content is qualitatively different than automation trivially doing whatever.
- zoogeny 3 months agoWhat you are feeling is described in The Work of Art in the Age of Mechanical Reproduction [1]
1. https://en.wikipedia.org/wiki/The_Work_of_Art_in_the_Age_of_...
- chunky1994 3 months ago
- gjsman-1000 3 months agoI agree even if I have mixed feelings on it.
To me this feels almost like the news complaining that they want a "link tax." Weren't their headlines and summaries used? It seems inconsistent to somehow say that AI and scraping is not okay; but that news companies should also not be entitled to their link tax. It's okay to index, but not that kind of index.
- theamk 3 months agoIt seems pretty cut-and-dry to me: the website owner should opt-out if they don't like the deal (being indexed in this case).
In the "link tax" case, there were plenty of trivial ways to opt out of headline usage - robots.txt, http headers, http tags. The problem was newspapers did not want to opt out (as they were benefiting from Google themselves), so they wanted a 3rd option. Which was pretty stupid of course - if you don't like the deal, don't take it; suing the offering party for a better deal is not a good long-term strategy.
In the AI case, there is no opt-out. All those websites already indicated they want to opt-out via robots.txt, but the AI companies ignore robots.txt, change user-agent, fake IPs, and so on - do the things that are normally done by shady malwar-ish services rather than multi-billion-dollar companies.
It really bothers me when people don't see the difference between those two cases.
- theamk 3 months ago
- TacticalCoder 3 months ago[dead]
- paxcoder 3 months ago[dead]
- prosody 3 months ago