(huggingface.co)
The bigger concern is how large the git history is going to get on the repository.
This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!
So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.
There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.
Perhaps you could use a local translation model to rephrase (such as TranslateGemma). If translating English to English doesn't achieve this effect then use an intermediate language, one the model is good at to not mangle meaning too much.
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
So to get all the data you need to grab the archive and all the 5 minute update files.
archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...
update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...
probably uncalled for
they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.
Wouldn't that lose deleted/moderated comments?
deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.
I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)
[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]
Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.
Comments+posts are defined as user generated content, you have no right to its privacy/control in any capacity once you post it - https://www.ycombinator.com/legal/
YC in theory has the right to go after unauthorized 3rd parties scraping this data. YC funds startups and is deeply vested in the AI space. Why on Earth would they do that.
Copyright doesn't seem to matter unless you're an IP cartel or mega cap.
https://www.ycombinator.com/legal/
Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)
The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).
If it's owned by you and only licensed by HN shouldn't you be the one enforcing it?
> ... a nonexclusive
I.e. this section is talking to additional rights to the content you post to ALSO go to YC, not that YC is guaranteeing it (+friends) will be the only one to hold these rights or will enforce who else should hold the rights to your publicly shared content for you.
There's a more intricate conversation to be had with GDPR and public data on forums in general but that's wholly unrelated to what YC's legal page says and still unlikely to end up in an alarming result.
Imagine Facebook claiming that by uploading images you are granting them exclusive usage rights to that image. It would mean you couldn't upload it to any other site with similar terms anymore.
That said, there are "no scraping" and "commercial use restricted" carve-outs for the content on HN. Which honestly is bullshit.
Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.
Other Users: certain actions you take may be visible to other users of the Services.
Then again, I'm not the guy that is going to get sued...
I agree. It's the owners of the sites that have to follow rules, not us.
And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.
Grass and property require upkeep. Radio waves and electromagnetic radiation do not.
I don't want your dog to piss on my lawn and kill my grass. But what harm does it cause me if you take a picture of my lawn? Or if I take a picture of your dog?
If I spend $100M making a Hollywood movie - pay employees, vendors, taxes - contribute to the economic growth of the country - and then that product gets stolen and given away completely for free without being able to see upside, that's a little bit different.
But my Hacker News comment? It's not money.
I think there are plausible ways to draw lines that protect genuine work, effort, and economics while allowing society and innovation to benefit from the commons.
I know, because I've been here since maybe 2015 or so, but this account was created in 2019.
So any PII you have mentioned in your comments is permanent on Hacker News.
I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.