Rendered at 21:13:49 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
ctippett 20 hours ago [-]
Am I correct that this has come about because archive.org respects robots.txt and these sites have blocked their crawler from indexing their sites?
I'm not sure how to articulate my thoughts on this exactly, other than to say it's disappointing that doing the right thing (i.e. respecting robots.txt) is rewarded with the burden of soliciting responses to a petition while at the same time others are rewarded with profit for ignoring those same directives.
Paracompact 20 hours ago [-]
Don't know if it helps your musings at all, but there's a good chance that if a high-profile crawler like archive.org disrespected their robots.txt, that archive.org would be faced with lawsuits (or some other form of pressure). This is not merely the most moral move; rather it is the only sensible move.
The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.
GolfPopper 19 hours ago [-]
>pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating
I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.
Fair point. Being small and shadowy is a sufficient condition to avoid litigation, but not a necessary one. Another sufficient condition is having billions of dollars to throw around. Unfortunately, archive.org is well known, well loved, and fundamentally harmless.
fragmede 5 hours ago [-]
> fundamentally harmless.
This is going to go in a boring direction with an argument thread that's been made since Internet time immemorial, and before. The argument goes: Pirating articles off nyt.com leads to lost sales of subscriptions, so it's not harmless. The response is, inevitably, no it doesn't, it leads to more sales. Or, people who weren't going to pay weren't going to pay anyway, so might as well give it to them for free, and be happy (as the NYT) for free advertising. And then the follow up, "No, it's a lost sale and journalism needs the money." HN is for thoughtful and substantive discussion, not for rehashing the same boring argument we've all read a thousand times. So my question isn't which camp is right. Both camps are firm in their beliefs. Copyright infringement is fine, copyright infringement is not. My question is in today's AI-fueled digital hellscape, how do we support journalists and the arts? If journalism only exists because eg Jeff Bezos pays for the Washington Post, we're going to get biased reporting (which has existed since long before the Internet); If art only exists because the artists come from rich families or have patrons like the Renaissance era, is society better off?
ryandrake 17 hours ago [-]
Side note: You probably mean "flout" instead of "flaunt."
Wowfunhappy 9 hours ago [-]
But AI companies don’t publicly redistribute the content they scrape, whereas Internet Archive does.
Even if you believe what the AI companies are doing is or should be a copyright violation, the Internet Archive is redistributing in a more direct manner.
cmeacham98 20 hours ago [-]
Correct. Example snippet from the nytimes.com robots.txt:
User-agent: archive.org_bot
Disallow: /
mjmas 11 hours ago [-]
Is there a difference between that and User-agent: ia_archiver ?
Which they don’t respect. I’ve had it for my blog for years and they still added it to wayback machine, see my last comment for their official announcement of the ignore robots.txt policy, it is not new.
socalgal2 17 hours ago [-]
robots.txt means they shouldn't auto-scan your site. Any user though can go to the wayback machine and type in a URL and the wayback machine will read that URL. That was the intent of robots.txt (don't scan) not (don't read period). It's spelled out in the spec for robots.txt
I wonder how archive.org_bot behaves when <meta name="robots" content="noindex, noarchive, nocache" /> is present.
ninjagoo 11 hours ago [-]
> I’ve had it for my blog for years
Just out of curiosity, why don't you want your public blog archived? not questioning, just trying to understand the logic/motivations?
Also, I think you're being unfairly downvoted.
Gigachad 20 hours ago [-]
It's because they want to restrict AI companies from stealing content, but they can't do it if internet archive proxies it all for them.
All of the LLMs would be massively less useful if it wasn't for scraping the latest news.
stephen_g 19 hours ago [-]
LLMs have other ways of accessing the content, they don’t need the Web Archive.
Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.
overfeed 19 hours ago [-]
> LLMs have other ways of accessing the content, they don’t need the Web Archive.
What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.
Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.
AnthonyMouse 16 hours ago [-]
This is like arguing that services can't provide access to libraries that provide public WiFi because it would give the public legal permission to pirate TV shows. They're two unrelated things. And then some members of the public argue that they're making fair use rather than pirating anything, but that still has nothing to do with the library.
stephen_g 15 hours ago [-]
But as I understand it, the Web Archive does respect robots.txt, while LLM scrapers absolutely do not and use all sorts of dodgy methods to get around it already...
The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).
Gigachad 18 hours ago [-]
The legal implications would be different vs scraping publicly available content.
AnthonyMouse 15 hours ago [-]
Is there a case that actually says this? Why would whether something is fair use depend on that? For that matter, how would they even show that a given AI model was trained on something from a recursive crawler rather than the same articles added to the training data after being downloaded by hand?
Gigachad 13 hours ago [-]
There was a similar case where a web scraper was bypassing prevention mechanisms on linked in
That case is why Twitter, and anyone else with lawyers paying attention went and put content behind a login wall.
AnthonyMouse 12 hours ago [-]
That case seems to imply the opposite?
switzer 11 hours ago [-]
LLMs would then license content from news orgs and other publishers, which is what should happen.
userbinator 18 hours ago [-]
"stealing" is BS because the original still exists. Copyright infringement is more correct.
jasonfarnon 16 hours ago [-]
they're stealing page views
Gigachad 18 hours ago [-]
You can call it whatever you want but it’s killing journalism when LLMs can automatically scrape and reword all the news. Sucking up the profits without contributing anything back to the people who created the work.
NeutralCrane 8 hours ago [-]
I don’t think many people are getting daily news from LLMs. Journalism has been dying since long before LLMs burst onto the scene as well.
There really isn’t even a defensible argument as to how this even should be illegal. The idea that someone can read words about a concept, and then rewording an explanation of that concept somehow violating the rights of the original author, is absurd.
The issue here and elsewhere isn’t LLMs. It’s that IP as a concept has always been a dystopic farce. Despite this we have not only kicked the can down the road on addressing this, we’ve doubled and tripled down and built our society around the concept. The advent of AI has simply blown the scale of the problem up to the point where it cannot be ignored any longer.
fragmede 4 hours ago [-]
> I don’t think many people are getting daily news from LLMs.
How many people do you think use LLMs in some fashion at all in their daily lives? Genuine question, I'm sure my personal experience is a biased sample, but so is everyone else's. Stats from AI companies isn't going to be (seen as) objective either. OpenAI and Anthropic are pushing a feature where I get a situation report at 9am like I'm an important official. With both labs pushing that, I think some people are getting their daily news from LLMs, the question is how many would it take for it to be meaningful, and how would we know if/when that bar gets crossed? What are the implications of that?
AnthonyMouse 12 hours ago [-]
The general problem here is that as soon as something is news, there will be not only numerous articles about it from multiple publications but also discussion of it on social media.
Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.
Aren't you choosing to ignore something very specific specified in that article? Why do you make it seem that article implies it's their overall policy?
> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).
joecool1029 15 hours ago [-]
> Aren't you choosing to ignore something very specific specified in that article?
Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”
The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.
userbinator 18 hours ago [-]
It's the same idiocy that DRM created.
Be a pirate, because a pirate is free...
17 hours ago [-]
switzer 11 hours ago [-]
I think the problem is that when Archive.org has access to NYT and other publisher content, people can scrape NYT content at scale from Archive.org even when they cannot do so directly on NYT. If Archive.org blocks scrapers, maybe the publishers would make different choices and allow Archive.org access.
ajaimk 19 hours ago [-]
Idea: allow scraping but can’t publish for 1 year?
neop1x 6 hours ago [-]
That actually sounds like a cool idea! If only they would allow at least this.
JumpCrisscross 16 hours ago [-]
And find a litigation pool so the Archive can ensure LLMs crawling it contribute back.
tracker1 4 hours ago [-]
Tin hat on... I can't help but think that part of it is that they like being able to make stealth edits and pretend earlier versions of articles never existed.
someperson 22 hours ago [-]
Maybe they should have an escrow like Financial Times is available on NewsBank service with a 30 day escrow
WarmWash 19 hours ago [-]
A bunch of people who have haven't ever loaded an ad or paid a subscription to those organizations are going to make a stand to demand they leave their backdoor open?
JumpCrisscross 20 hours ago [-]
I know a little about this debate on the Times and Atlantic sides. I’ll get some grief for this, but I asked a senior person at the former what they thought about the paywall workarounds that are frequent on HN—I was genuinely shocked to learn they hadn’t heard about it.
In the end, we settled on agreeing that making such stuff available after 30 days, and possibly with access restrictions (can’t be pulled more than N times a day, in case it becomes relevant in the future) struck the right balance.
To my knowledge, the Internet Archive hasn’t done any outreach on this issue. In addition to pressuring the publications, I’d put some pressure on them to negotiate.
jasonfarnon 16 hours ago [-]
This seems like a nice compromise. The news orgs get to keep the initial flurry of page views while the free information/universal library role of the Internet is maintained. But still those magazines will want to control their back catalogues. They currently sell access to libraries/universities. And as many on HN suggest, some of those news orgs would like to change/update stories without a publicly available "revision history".
JumpCrisscross 42 minutes ago [-]
> They currently sell access to libraries/universities. And as many on HN suggest, some of those news orgs would like to change/update stories without a publicly available "revision history"
Solution would be to restrict LLMs from training on the archives. Libraries and universities paying for back access is more complicated.
boomboomsubban 18 hours ago [-]
>about the paywall workarounds that are frequent on HN—I was genuinely shocked to learn they hadn’t heard about it.
Is the Internet Archive regularly used as a paywall workaround? Generally it's archive.is, which has no connection to the IA.
Wowfunhappy 10 hours ago [-]
It’s certainly not as common as archive.is, but I’ve seen it done on HN. I even commented at one point something to the effect of “this seems like a great way to get news sites to block the Internet Archive”…
10 hours ago [-]
tolerance 16 hours ago [-]
That's not the point.
boomboomsubban 16 hours ago [-]
Huh? IA not doing what they claim seems fairly important to their point.
frm88 16 hours ago [-]
Yes, he got that wrong. The IA doesn't remove paywalls.
16 hours ago [-]
themafia 19 hours ago [-]
> can’t be pulled more than N times a day, in case it becomes relevant in the future
In case it "becomes relevant." Wouldn't that benefit you either way? It makes you wonder if they have a dashboard of unfortunate digital statistics on display somewhere and worship of these numbers have replaced the underlying spirit of journalism.
JumpCrisscross 2 hours ago [-]
> Wouldn't that benefit you either way?
I’m not in journalism. Becoming relevant down the road would benefit the journal, but they’d also want to get the views in case it converts into subscriptions. Hence the pull limit.
> the underlying spirit of journalism
The places that ignore the business of journalism cable fund the fight true journalism requires. To the extent I think there is a betrayal of fundamentals, it’s in the papers who went with free content driven by ads.
areoform 18 hours ago [-]
Not surprised. They're working from the wrong model for the wrong age with the wrong incentives. They're still acting like they live in a world where data and information is scarce; and they are the one true source of truth.
It's flipped right now. There's no single source of ground truth, but data and information are abundant. Yes, that abundance that includes false data and lies, but it is still abundance.
The work The New York Times and The Atlantic do at their best days, i.e. their investigative journalism team adds to this world, but they try to hide / cloister that work away even though the journalists themselves want to make it accessible.
In an ideal world, every child would learn how to read english via the NYT and The Atlantic, they'd grow up with these sources of record, learn from them, and watch the world through them. But the current model doesn't allow for that.
I think a patronage mixed with wikimedia-style foundation might be a better fit. Readers who love the institution and its mission are invited to pay as much as they want with scaling benefits (let's say you love the NYT so much that you want to give $10k/mo for their work, you should get commensurate access / get to ask questions). And these contributions flow into the endowment, which is invested and the outputs of that are distributed as a part of their operating budget.
I don't think classical journalism can survive an information abundant world without a patronage-based approach.
JumpCrisscross 16 hours ago [-]
> patronage mixed with wikimedia-style foundation might be a better fit
Maybe. The alternative is most people simply aren’t going to engage with long-form journalism. Keeping the analysis behind subscriptions while video summaries make ad revenue on YouTube and Twitter might be the best fit.
fragmede 4 hours ago [-]
Best fit for what? America's implementation of techno feudalism? How do we as a society get deep investigative journalists, like the ones who brought down Nixon, when that same society has the attention span befitting TikTok?
JumpCrisscross 2 hours ago [-]
> when that same society has the attention span befitting TikTok?
I think the point is all of society isn’t that. Some people still pay for proper journalism. Those who don’t want to don’t have to.
fragmede 1 hours ago [-]
If I don't have kids thus don't want to pay for public school, I still pay my taxes which funds them because society is better off well-educated, or at least literate.
armchairhacker 15 hours ago [-]
I would be glad if these “news” sites weren’t posted to HN at all. If the article is true and worth discussing, it will be reported by a more reputable organization (e.g. Reuters) or it’s a primary source that should be posted directly (sometimes the source is posted then a news article covering it is posted later, I don’t know why both aren’t merged).
Too often they’ve been caught selectively reporting details and quotes, or reporting facts from an unreliable source that turned out to be outright false. In the latter case they quietly retract the article, so most readers continue believing the lie (maybe that’s why they don’t want to be archived).
Even posting a small blog is better, while it can also be biased and untrustworthy, if it has original thought, supports an individual, and doesn’t have ads. Although the amount of obvious LLM blogs submitted here is another issue.
JumpCrisscross 15 hours ago [-]
> if the article is true and worth discussing, it will be reported by a more reputable organization (e.g. Reuters) or it’s a primary source that should be posted directly
The primary source of investigative journalism is the newspaper.
armchairhacker 15 hours ago [-]
Yes, but sometimes they paraphrase an article from a different news organization, and other times they’re not trustworthy.
If a NY Times article is corroborated or even paraphrased itself by a more trustworthy organization, or has direct links to multiple primary sources, I wouldn’t mind. Except the NY Times article is still paywalled, and there may be a source that’s not, in which case I still think that source should be submitted instead.
JumpCrisscross 13 hours ago [-]
Both should be submitted. I’m going to upvote the better source. Which more often than not, is the one that predominantly pays itself from subscribers versus ads.
Need a cryptographically verifiable internet archive. This is probably not possible without something like web 3 or nostr or gpg pgp. Idk.
armchairhacker 16 hours ago [-]
Many unrelated archives would be good enough
karel-3d 16 hours ago [-]
Can't the archive publish the SSL signatures of all the requests or something?
You can cryptographically verify a timestamp though by piggybacking on bitcoin like opentimestamps do.
grebc 18 hours ago [-]
[flagged]
eranation 19 hours ago [-]
I signed, but let’s be honest.
A pie chart showing the times I used the wayback machine to read an old NYT article vs the times I visited it due to a highly upvoted top HN comment linking to a relatively new article so we all can bypass the paywall is a solid circle.
gblargg 18 hours ago [-]
Would you have paid NYT to view the article if there weren't an archived copy? I doubt it.
glitcher 16 hours ago [-]
I would pay a small amount to read one article but I’m not going to subscribe. Who offers that?
Permit 16 hours ago [-]
Blendle, Scroll, Flattr and several others have attempted this. It turns out no consumer actually wants to do this, it’s primarily an idea that’s invoked on HackerNews to defend not subscribing to journalism while using ad blockers, it’s not a real business model.
suddenlybananas 13 hours ago [-]
How much do they charge per article? If it's above 10 cents or so, I can't imagine it being a reasonable price.
JumpCrisscross 16 hours ago [-]
> Would you have paid NYT to view the article if there weren't an archived copy?
That’s how I signed up to The Atlantic. I wanted to read the Signalgate reporting. There are other publications which get upvoted here frequently that have the paywall workarounds. I generally click around their paywall.
18 hours ago [-]
crowcroft 18 hours ago [-]
Ok, but what about Meta and X etc.
drivingmenuts 16 hours ago [-]
What is the advantage to those organizations to have their work preserved? If their work is stored in a public archive, they can’t charge for it and they lose money. If they make a mistake, then history is what they say it is and there is no external record to say otherwise.
JumpCrisscross 15 hours ago [-]
> What is the advantage to those organizations to have their work preserved?
It becomes a research resource. It also creates a high-friction interface for potential subscribers.
I wound up subscribing to Le Monde Diplo because of a HN comment referencing a paywalled article. I didn't want to sign up just for one article. So I bypassed using one of the circumvention sites (I think outline was popular then). The article was compelling enough that I signed up for the paper, and remain subscribed to this day.
keybored 13 hours ago [-]
Chomsky one time was talking about, gosh, his eyes are too old to be reading this microformat thing with a magnifier at the library in order to research archived newspapers like The (New York) Times. (This was sometime in the 90’s.)
m101 9 hours ago [-]
Isn't it true that the more people that sign this petition serves to increase the case for the NYT to not be allowing access to archive.org (as many have said - most people only care because it allows them to circumvent paywalls)?
karel-3d 16 hours ago [-]
There is still archive.today, too bad the owner is crazy
shevy-java 16 hours ago [-]
We are kind of losing the world wide web here or at the least part of how we could use it in the past. More and more key services get knocked out; see the associated rise of age snifing and the campaign to destroy VPNs.
sublinear 20 hours ago [-]
After many years of these media outlets circling the drain, this is likely the clearest signal of their irrelevance. It's not like anyone is committing these rags to microfiche anymore.
giwook 20 hours ago [-]
And by what standards have you determined that these outlets are circling the drain?
The work of independent journalists is more important than ever before.
beej71 20 hours ago [-]
More important than ever before and less market value than ever before. :(
weberer 10 hours ago [-]
>The work of independent journalists is more important than ever before.
Correct, and they're stealing viewers from these corporate media orgs like NYT. A lot of young people are getting their news from independent journalists on Youtube like Nick Shirley.
tstrimple 27 minutes ago [-]
This has to be a sick joke right? Nick Shirley is the farthest thing from an “independent journalist”. He’s just another right wing hack trying to stir up controversy. And it works because people who consume his content are looking for exactly what he is saying not because of any sort of truth.
What else you got? Alex Jones was an independent journalist? Rush Limbaugh? Tim Poole? Crazy.
awakeasleep 19 hours ago [-]
It’s kind of shocking to read what you wrote, and realize those big media brands used to be independent journalism.
18 hours ago [-]
monkaiju 20 hours ago [-]
Are we considering the NYT and USA Today "independent journalism" still? Seems dubious...
giwook 18 hours ago [-]
I don't know about USA Today. NYT at least seems independent if left leaning. I've not seen them be unfairly biased or bend over backwards to cater to outside corporate interests just yet. They're certainly not bending the knee to the current administration.
They have a robust paying subscriber base that supports them and don't have an owner whose last name rhymes with Pesos who can axe a story just because he doesn't like what it says.
rmunn 18 hours ago [-]
That a Democrat-leaning paper would criticize Republican politicians is not surprising. A better test of independence would be whether they criticize Democratic politicians (when they do things deserving criticism, that is: I don't expect them to criticize policy positions that they agree with, but all politicians do some things, in some cases many things, deserving of criticism).
giwook 9 hours ago [-]
The point is that they shouldn't be criticizing anyone which I think is the point of independent journalism.
That they publish articles that put Republicans in an unfavorable light is I think because Republicans are doing things that put themselves in an unfavorable light.
To your point, there have been at least a few articles I've seen that put Democrats in an unfavorable light as well.
And for what it's worth I consume news outlets that lean both ways. What's more important to me is factual accuracy.
ks2048 19 hours ago [-]
> the clearest signal of their irrelevance
NYT had $2.82B in revenue in 2025.
themafia 19 hours ago [-]
> It's not like anyone is committing these rags to microfiche anymore.
I recommend you actually go and read those fiches. The press was not historically high quality. Mass media has had the same problems for decades.
What it used to have was genuine independent competition.
kr108sdh 20 hours ago [-]
The petition should be to ban the AI theft. If it is on wayback, the bots could as well scrape the NYT directly.
The NYT is of course guilty itself. It did not investigate the possible murder of its star witness Suchir Balaji and is too reserved in examining the consequences of AI in general.
If they don't fulfill their journalistic and societal obligations, soon its own journalists will be replaced by AI bullet point slop like Axios.
WarmWash 19 hours ago [-]
Can we just go back to ads and normalize blocking people who ad-block?
I'm grown up now, I understand how things work, and I'd rather see Tide and Coke ads than pay $20/mo to 8 different orgs, while maintaining that ad free option for those who want it.
The children of the internet probably won't sign a truce, so let's just cut them out and let intellectually honest people have a decent internet.
goosejuice 19 hours ago [-]
I'm a paying NYT subscriber for years. NYT has a ton of ads, even for subscribers. They don't offer an ad free version despite it being totally viable at a few more bucks a month based on their finances. Their ads are super disruptive to reading and their privacy policy appears to indicate they buy and sell your data.
I dunno. That seems like a pretty big fuck you to a paying customer already when all they have to do is provide a sub for a few more bucks a month. But I guess I'm a child of the Internet.
vkou 12 hours ago [-]
Any customer who has the money to pay extra to skip ads is the most valuable customer for the ads to target.
shimman 19 hours ago [-]
How about we go back to the era of humanity where modern marketing didn't exist?
How much faster would consumer software be if adware was made illegal? How much faster would our devices be if we didn't have half the code base supporting malware?
Acting like an ad enabled internet was the only option is extremely foolish, especially when the ad enabled internet was fully chosen and pushed onto the public by very specific people (thanks Newt Gingrich!).
kmoser 19 hours ago [-]
> How about we go back to the era of humanity where modern marketing didn't exist?
That era vastly predates the Internet, let alone the (relatively) ad-free pre-1980s Internet, neither of which we can return to in any meaningful fashion.
elashri 19 hours ago [-]
> Can we just go back to ads and normalize blocking people who ad-block?
Nope, two problems
1- Ads is privacy issue not only convenience issue. Targeted ads should not normalized.
2- Companies figures out that even paying doesn't means you don't get ads. You probably are bigger target with more disposable income than average in such case.
chadgpt3 19 hours ago [-]
We can't - LLMs don't proxy ads.
platevoltage 17 hours ago [-]
I'm fine with ads as long as they are integrated with the page. What I hate is the typical Google Adsense garbage where the same ad is plastered in 4 different places on the page, with a video ad playing in the corner, and if you're lucky, a popup ad as well.
32sGqt 19 hours ago [-]
[dead]
GolfPopper 19 hours ago [-]
>cut them out and let intellectually honest people have a decent internet.
Ah, so, take the money out of it completely? No subscriptions, and no ads? Sounds like a good idea to me.
Permit 15 hours ago [-]
Would you work for free?
LNSY 20 hours ago [-]
[flagged]
righthand 21 hours ago [-]
Wouldn’t it be better to let these legacy news orgs (which aren’t really anything beyond advertising and data harvesting firms) block archive.org and thus no one will read their articles and they can go under? I’m struggling to think of a reason I need NY Times. I’ve never had a subscription and never seen writing that I thought benefited me as a citizen (they’re Very pro-war of any kind).
JumpCrisscross 20 hours ago [-]
> block archive.org and thus no one will read their articles and they can go under?
…why would they go under if the people who don’t pay for news stop reading them?
sublinear 20 hours ago [-]
Media influence and authority has historically depended on getting cited by writing that is more directly relevant to the reader's concern (i.e. the topic of research).
The paywalls were one thing, but disallowing archival is practically suicide.
JumpCrisscross 16 hours ago [-]
> disallowing archival is practically suicide
The Times alone pulls a multiple of the Internet Archive’s visitors [1][2].
Yes and citations are a matter of quality, not quantity.
The whole point of archiving is so that people can review it later. People living in the future are the vast majority of readership (and no they didn't pay for it).
The article's place in historical context is far more important than the paper itself. Writing that stands the test of time and that gets cited frequently is where all the authority and credibility comes from. It's absurd that the NYT of all places can be this boneheaded, but I guess it's a sign of the times.
19 hours ago [-]
b00ty4breakfast 20 hours ago [-]
if people are reading the articles through wayback, then they aren't making any money because no data is harvested and no click-thrus or impressions or whatever the metric is are registered.
AnthonyMouse 16 hours ago [-]
People are willing to post links to paywalled articles when there are ways for people not currently inclined to subscribe to read them. Even if 97% of the current non-subscribers bypass the paywall, having 3% become subscribers is very useful, especially if they become recurring subscribers.
If posting the link instead implies that the 97% of people not currently willing to subscribe can't read it, then people instead post a link to a publication their audience can read, in which case the first publication gets actually 0%.
xyzzy_plugh 20 hours ago [-]
The title freaked me out. I thought this was about the Wayback Machine going away but no, it's just news publications blocking being archived.
I guess I don't really care. As soon as it becomes unworkable to view these publications through archivers I'll just stop viewing them altogether. I don't see this helping their bottom line though.
ameliaquining 20 hours ago [-]
As long as other people are reading them, they're important for understanding what's happening in the world and what information the public is getting, which is why we need an accessible archive of their content.
redwall_hp 20 hours ago [-]
Exactly. Libraries have kept microfiche archives of newspapers for forever, and they're an essential part of historical research.
They also preserved old books. But now I guess they're becoming middlemen for access to limited ebook platforms that ensure books disappear when publishers lose interest.
The "Information Age" is proving to be the setup for a dark age, when nonprofitable things are just thrown out and efforts to preserve them are actively fought.
layman51 20 hours ago [-]
I think part of this is important too because online news articles might have corrections, or certain paragraphs might get deleted in some rare situations. It's good to have a way of tracking those. Sometimes, the edits made to an article are very irrelevant to the actual message. I'm thinking stuff like typos, or even embarrassing gaffes like the recent time that a headline implied that the NATO acronym had the word "American" in it.
I'm not sure how to articulate my thoughts on this exactly, other than to say it's disappointing that doing the right thing (i.e. respecting robots.txt) is rewarded with the burden of soliciting responses to a petition while at the same time others are rewarded with profit for ignoring those same directives.
The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.
I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.
"Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report" (2024) https://www.tomshardware.com/tech-industry/artificial-intell...
This is going to go in a boring direction with an argument thread that's been made since Internet time immemorial, and before. The argument goes: Pirating articles off nyt.com leads to lost sales of subscriptions, so it's not harmless. The response is, inevitably, no it doesn't, it leads to more sales. Or, people who weren't going to pay weren't going to pay anyway, so might as well give it to them for free, and be happy (as the NYT) for free advertising. And then the follow up, "No, it's a lost sale and journalism needs the money." HN is for thoughtful and substantive discussion, not for rehashing the same boring argument we've all read a thousand times. So my question isn't which camp is right. Both camps are firm in their beliefs. Copyright infringement is fine, copyright infringement is not. My question is in today's AI-fueled digital hellscape, how do we support journalists and the arts? If journalism only exists because eg Jeff Bezos pays for the Washington Post, we're going to get biased reporting (which has existed since long before the Internet); If art only exists because the artists come from rich families or have patrons like the Renaissance era, is society better off?
Even if you believe what the AI companies are doing is or should be a copyright violation, the Internet Archive is redistributing in a more direct manner.
https://en.wikipedia.org/wiki/Alexa_Internet
I wonder how archive.org_bot behaves when <meta name="robots" content="noindex, noarchive, nocache" /> is present.
Just out of curiosity, why don't you want your public blog archived? not questioning, just trying to understand the logic/motivations?
Also, I think you're being unfairly downvoted.
All of the LLMs would be massively less useful if it wasn't for scraping the latest news.
Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.
What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.
Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.
The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).
https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
There really isn’t even a defensible argument as to how this even should be illegal. The idea that someone can read words about a concept, and then rewording an explanation of that concept somehow violating the rights of the original author, is absurd.
The issue here and elsewhere isn’t LLMs. It’s that IP as a concept has always been a dystopic farce. Despite this we have not only kicked the can down the road on addressing this, we’ve doubled and tripled down and built our society around the concept. The advent of AI has simply blown the scale of the problem up to the point where it cannot be ignored any longer.
How many people do you think use LLMs in some fashion at all in their daily lives? Genuine question, I'm sure my personal experience is a biased sample, but so is everyone else's. Stats from AI companies isn't going to be (seen as) objective either. OpenAI and Anthropic are pushing a feature where I get a situation report at 9am like I'm an important official. With both labs pushing that, I think some people are getting their daily news from LLMs, the question is how many would it take for it to be meaningful, and how would we know if/when that bar gets crossed? What are the implications of that?
Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.
> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).
Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”
The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.
Be a pirate, because a pirate is free...
In the end, we settled on agreeing that making such stuff available after 30 days, and possibly with access restrictions (can’t be pulled more than N times a day, in case it becomes relevant in the future) struck the right balance.
To my knowledge, the Internet Archive hasn’t done any outreach on this issue. In addition to pressuring the publications, I’d put some pressure on them to negotiate.
Solution would be to restrict LLMs from training on the archives. Libraries and universities paying for back access is more complicated.
Is the Internet Archive regularly used as a paywall workaround? Generally it's archive.is, which has no connection to the IA.
In case it "becomes relevant." Wouldn't that benefit you either way? It makes you wonder if they have a dashboard of unfortunate digital statistics on display somewhere and worship of these numbers have replaced the underlying spirit of journalism.
I’m not in journalism. Becoming relevant down the road would benefit the journal, but they’d also want to get the views in case it converts into subscriptions. Hence the pull limit.
> the underlying spirit of journalism
The places that ignore the business of journalism cable fund the fight true journalism requires. To the extent I think there is a betrayal of fundamentals, it’s in the papers who went with free content driven by ads.
It's flipped right now. There's no single source of ground truth, but data and information are abundant. Yes, that abundance that includes false data and lies, but it is still abundance.
The work The New York Times and The Atlantic do at their best days, i.e. their investigative journalism team adds to this world, but they try to hide / cloister that work away even though the journalists themselves want to make it accessible.
In an ideal world, every child would learn how to read english via the NYT and The Atlantic, they'd grow up with these sources of record, learn from them, and watch the world through them. But the current model doesn't allow for that.
I think a patronage mixed with wikimedia-style foundation might be a better fit. Readers who love the institution and its mission are invited to pay as much as they want with scaling benefits (let's say you love the NYT so much that you want to give $10k/mo for their work, you should get commensurate access / get to ask questions). And these contributions flow into the endowment, which is invested and the outputs of that are distributed as a part of their operating budget.
I don't think classical journalism can survive an information abundant world without a patronage-based approach.
Maybe. The alternative is most people simply aren’t going to engage with long-form journalism. Keeping the analysis behind subscriptions while video summaries make ad revenue on YouTube and Twitter might be the best fit.
I think the point is all of society isn’t that. Some people still pay for proper journalism. Those who don’t want to don’t have to.
Too often they’ve been caught selectively reporting details and quotes, or reporting facts from an unreliable source that turned out to be outright false. In the latter case they quietly retract the article, so most readers continue believing the lie (maybe that’s why they don’t want to be archived).
Even posting a small blog is better, while it can also be biased and untrustworthy, if it has original thought, supports an individual, and doesn’t have ads. Although the amount of obvious LLM blogs submitted here is another issue.
The primary source of investigative journalism is the newspaper.
If a NY Times article is corroborated or even paraphrased itself by a more trustworthy organization, or has direct links to multiple primary sources, I wouldn’t mind. Except the NY Times article is still paywalled, and there may be a source that’s not, in which case I still think that source should be submitted instead.
You can cryptographically verify a timestamp though by piggybacking on bitcoin like opentimestamps do.
A pie chart showing the times I used the wayback machine to read an old NYT article vs the times I visited it due to a highly upvoted top HN comment linking to a relatively new article so we all can bypass the paywall is a solid circle.
That’s how I signed up to The Atlantic. I wanted to read the Signalgate reporting. There are other publications which get upvoted here frequently that have the paywall workarounds. I generally click around their paywall.
It becomes a research resource. It also creates a high-friction interface for potential subscribers.
I wound up subscribing to Le Monde Diplo because of a HN comment referencing a paywalled article. I didn't want to sign up just for one article. So I bypassed using one of the circumvention sites (I think outline was popular then). The article was compelling enough that I signed up for the paper, and remain subscribed to this day.
The work of independent journalists is more important than ever before.
Correct, and they're stealing viewers from these corporate media orgs like NYT. A lot of young people are getting their news from independent journalists on Youtube like Nick Shirley.
What else you got? Alex Jones was an independent journalist? Rush Limbaugh? Tim Poole? Crazy.
They have a robust paying subscriber base that supports them and don't have an owner whose last name rhymes with Pesos who can axe a story just because he doesn't like what it says.
That they publish articles that put Republicans in an unfavorable light is I think because Republicans are doing things that put themselves in an unfavorable light.
To your point, there have been at least a few articles I've seen that put Democrats in an unfavorable light as well.
And for what it's worth I consume news outlets that lean both ways. What's more important to me is factual accuracy.
NYT had $2.82B in revenue in 2025.
I recommend you actually go and read those fiches. The press was not historically high quality. Mass media has had the same problems for decades.
What it used to have was genuine independent competition.
The NYT is of course guilty itself. It did not investigate the possible murder of its star witness Suchir Balaji and is too reserved in examining the consequences of AI in general.
If they don't fulfill their journalistic and societal obligations, soon its own journalists will be replaced by AI bullet point slop like Axios.
I'm grown up now, I understand how things work, and I'd rather see Tide and Coke ads than pay $20/mo to 8 different orgs, while maintaining that ad free option for those who want it.
The children of the internet probably won't sign a truce, so let's just cut them out and let intellectually honest people have a decent internet.
I dunno. That seems like a pretty big fuck you to a paying customer already when all they have to do is provide a sub for a few more bucks a month. But I guess I'm a child of the Internet.
How much faster would consumer software be if adware was made illegal? How much faster would our devices be if we didn't have half the code base supporting malware?
Acting like an ad enabled internet was the only option is extremely foolish, especially when the ad enabled internet was fully chosen and pushed onto the public by very specific people (thanks Newt Gingrich!).
That era vastly predates the Internet, let alone the (relatively) ad-free pre-1980s Internet, neither of which we can return to in any meaningful fashion.
Nope, two problems
1- Ads is privacy issue not only convenience issue. Targeted ads should not normalized.
2- Companies figures out that even paying doesn't means you don't get ads. You probably are bigger target with more disposable income than average in such case.
Ah, so, take the money out of it completely? No subscriptions, and no ads? Sounds like a good idea to me.
…why would they go under if the people who don’t pay for news stop reading them?
The paywalls were one thing, but disallowing archival is practically suicide.
The Times alone pulls a multiple of the Internet Archive’s visitors [1][2].
[1] https://www.semrush.com/website/archive.org/overview/
[2] https://www.semrush.com/website/nytimes.com/overview/
The whole point of archiving is so that people can review it later. People living in the future are the vast majority of readership (and no they didn't pay for it).
The article's place in historical context is far more important than the paper itself. Writing that stands the test of time and that gets cited frequently is where all the authority and credibility comes from. It's absurd that the NYT of all places can be this boneheaded, but I guess it's a sign of the times.
If posting the link instead implies that the 97% of people not currently willing to subscribe can't read it, then people instead post a link to a publication their audience can read, in which case the first publication gets actually 0%.
I guess I don't really care. As soon as it becomes unworkable to view these publications through archivers I'll just stop viewing them altogether. I don't see this helping their bottom line though.
They also preserved old books. But now I guess they're becoming middlemen for access to limited ebook platforms that ensure books disappear when publishers lose interest.
The "Information Age" is proving to be the setup for a dark age, when nonprofitable things are just thrown out and efforts to preserve them are actively fought.