undefined

upvote

points

by eqvinox1 days ago |

upvote

by tgv22 hours ago|

[-]

> Germans (because of course)

I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.

reply

upvote

by xenocratus22 hours ago|

[-]

I took the "because of course" to be about having a word for everything - a stereotypical idea about the German language.

reply

upvote

by greycol16 hours ago|

[-]

My understanding was that it was more that words can be concatenated into new words in German which is not so much a stereotype as more a misunderstanding of fact. I.e. You wouldn't think much about something like enjoyable-comuppence but schadenfreude looks more impressive without the hyphen.

reply

upvote

by gf0003 hours ago|

[-]

I would argue it's not the exact same thing. Sure, when overdone then you would get the same. But the way it is, commonly used concatenated words are words, not just hyphenated words. They are used as words and without an extra though people don't parse them into separate parts, unlike they do with a list of words with hyphens.

E.g. you don't think of firefighter as fire-fighter in ordinary usage.

reply

upvote

by dragontamer22 hours ago|

[-]

There's also the other implication that the (East) Germans were Soviet just 35 years ago.

But yes. We Americans know Germans more for their silly big words. But statements like that can be misinterpreted as the German perspective of themselves doesn't quite match the American stereotypes.

reply

upvote

by eqvinox19 hours ago|

[-]

I was implying all 3 of the above:

- we learned the hard way that data will be used to kill people, during the Nazi regime

- we learned it again in the GDR with the Stasi being a little less obvious but still ruining people's livelihoods

- and German comes up with compound words for such things

reply

upvote

by microtonal20 hours ago|

[-]

East Germany was not Soviet. Under influence/control of the Soviets, yes, but not part of the Soviet Union.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by yreg16 hours ago|

[-]

That's like saying that English (because of course) is able to describe the concept by a combination of words.

reply

upvote

by theptip22 hours ago|

[-]

The Stasi would be the obvious cultural context.

In the US of course the government buys this sort of information legally from corporations.

reply

upvote

by Swizec22 hours ago|

[-]

> The Stasi would be the obvious cultural context.

There is also the rather famous example of how earlier census data was used in the 40’s.

Once the government has your data, they have it. The next generation of representatives may not follow all the same rules and norms

reply

upvote

by RobotToaster22 hours ago|

[-]

The stasi could only dream of the kind of surveillance the NSA et al has today.

reply

upvote

by throwaway1737389 hours ago|

[-]

Or Facebook or Equifax.

reply

upvote

by tgv19 hours ago|

[-]

The West-German debate in the 70s came from the realization that the sheer size of the Holocaust/Shoah was in no small degree due to bureaucratic record keeping. Storing someone's ethnicity is potentially dangerous for that person.

reply

upvote

by Centigonal21 hours ago|

[-]

Germany resisted Google Street View until 2023, which was something I thought was very impressive.

reply

upvote

by mrsvanwinkle22 hours ago|

[-]

Love it, also love how Datenschatten can also imply that it disappears when someone shines light on it

reply

upvote

by reactordev22 hours ago|

[-]

If only our past 20 year old self data could be so ephemeral…

Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.

reply

upvote

by cyanydeez20 hours ago|

[-]

Yeah, so Germany had a ton of secret police files and of course learned very well what happens when a bunch of people start collecting dossiers.

So yeah, of course they've developed that type of distrust. Americans should have also after the 50-60s paranoia of red scare, black people etc. Instead they just spend a few decades building a anti-social state.

reply

upvote

by wlesieutre23 hours ago|

[-]

I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”

reply

upvote

by hdndjsbbs21 hours ago|

[-]

10+ years ago companies were hoovering up data for ML - trying to find correlations in high-dimensionality data. Mostly the results were garbage but occasionally you hit on a real, unexpected phenomenon.

Nowadays you just throw all the data into a black box and believe whatever it says blindly.

reply

upvote

by CincinnatiMan23 hours ago|

[-]

Were you not around for the Big Data heyday a decade ago?

reply

upvote

by varispeed23 hours ago|

[-]

Until thumb drives became large enough to fit most datasets it stopped becoming Big Data. Just normal data.

reply

upvote

by ffsm822 hours ago|

[-]

We have thumb drives that can store petabytes of data?

Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data

reply

upvote

by 0x45720 hours ago|

[-]

My rule of thumb was "can it fit in RAM on a server?" If it can, then it's not big data.

500GB is in the "fits" category.

reply

upvote

by gf0003 hours ago|

[-]

You can quadruple that and could still fit in server RAM

reply

upvote

by butlike22 hours ago|

[-]

> We have thumb drives that can store petabytes of data

We do?

reply

upvote

by dylan60421 hours ago|

[-]

It was a question that you've edited out the punctuation. You're asking the exact same thing as the person you've replied

reply

upvote

by ffsm822 hours ago|

[-]

Please provide a link.

reply

upvote

by BizarroLand21 hours ago|

[-]

You would need 4 and change of these 245tb Kioxias to hold 1 petabyte, and an entire server grade computer to run them.

https://www.tomshardware.com/pc-components/ssds/kioxia-unvei...

Or 250 of these ~$400 4tb flash drives and an insane number of dongles to connect them all:

https://www.slashgear.com/1847725/largest-usb-thumb-drive-hi...

reply

upvote

by vunuxodo13 hours ago|

[-]

Plus one more for your parity drive.

reply

upvote

by varispeed22 hours ago|

[-]

Most companies using term "big data" had datasets in TB region. One company I had a gig at had full Hadoop cluster setup and their whole dataset was 40GB. Their marketing had all the big data adjacent keywords over the brochures for clients.

reply

upvote

by gf0003 hours ago|

[-]

That's a decent quality 3 hours movie :D

reply

upvote

by jmalicki22 hours ago|

[-]

To some degree IMO big data is still a mindset when it might take a day to process your data in a normal SQL query. Some tech doesn't scale to the data size for all use cases, and you need different solutions.

reply

upvote

by ToucanLoucan22 hours ago|

[-]

Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.

Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.

reply

upvote

by citrin_ru23 hours ago|

[-]

Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.

reply

upvote

by Forgeties7923 hours ago|

[-]

“Before LLM’s there was_____”

I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.

Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.

reply

upvote

by dpoloncsak23 hours ago|

[-]

Do LLMs require that much more data than the tradional ML approaches we've seen over the years?

reply

upvote

by sigmoid1022 hours ago|

[-]

Yes. This is pretty well established. Neural networks in general are considerably less sample-efficient than traditional ML methods. The reason they became so successful is that they scale better as you increase training data and model size. But only with modern compute power they became useful outside of academic toy model applications.

reply

upvote

by Forgeties7921 hours ago|

[-]

That’s not the issue I’m hitting here primarily but yes.

My concern is that I can open up chatGPT and even with a free, “anonymous” account run an assembly line generating tens of thousands of words a day to pump to Twitter that are good enough to prop up multiple fake accounts and cause mayhem.

Now make it thousands of people like me doing it. Now add funding and political orgs. Add company leadership that turns a blind eye so long as it drives engagement. This scale and pipeline wasn’t possible 5 years ago, even if we clearly see the throughline.

I’m not even getting into fake images either. That used to require some know how. There are basically no hurdles and even if most people learn it’s fake, millions likely won’t. If you’re a little lucky, less scrupulous “news” outlets will amplify it for you as well for free.

reply

upvote

by b00ty4breakfast22 hours ago|

[-]

I really hate this when it's something negative that humans also do. It's like, yeah, people do do that, but why are we automating {negativeTrait}?

reply

upvote

by Forgeties7915 hours ago|

[-]

Unfortunately the answer is usually people just want to hand wave away the critique for one reason or another. “People already do that” is an easy truism for stifling discussion.

reply

upvote

by ToucanLoucan22 hours ago|

[-]

> Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.

I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).

I wish all social media sites a very haha die in a fire.

reply

upvote

by dpoloncsak21 hours ago|

[-]

Mate you're on a social media site right now that often has AI-generated content displayed at the top of whats "trending". Sure the general user-base does a better job here flagging that sort of stuff, as AI seems to be a shared interest in much of the community, but it still sneaks it's way by

reply

upvote

by Forgeties7917 hours ago|

[-]

You’re technically right but I think we can all agree HN is significantly different from the major players. The vast majority of us see the same posts and comments, for starters. The churn of posts is also much slower. You log on 2-3 times spread out in a day and you see 90% of the main posts. Top posts linger for 24-48hrs regularly.

No media uploading, memes are few and far between (usually punished), etc.

reply

upvote

by dnnddidiej5 hours ago|

[-]

Do Germans have lots of words or just a lack of spaces?

reply

upvote

by dhosek10 hours ago|

[-]

Or you could put it in a box with no connection to the internet.

Introducing… The Hooli Box!

reply

upvote

by littlecranky6722 hours ago|

[-]

Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.

reply

upvote

by Peritract22 hours ago|

[-]

The use of "steal" for non-physical things pre-dates the use of "data" in the modern sense [1]. Policing language incorrectly is not reasonable.

[0] https://www.opensourceshakespeare.org/views/plays/play_view....

[1] https://www.etymonline.com/word/data

reply

upvote

by altruios22 hours ago|

[-]

pedantic and true. What was stolen was not data, but future revenue based on exclusive access to that data.

reply

upvote

by gblargg4 hours ago|

[-]

Pedantic and relevant. If they lost the voice samples, they wouldn't have it for training new models. If they were copied, then they have lost nothing in terms of training.

reply

upvote

by dnnddidiej5 hours ago|

[-]

Money is not a physical thing.

reply

upvote

by b00ty4breakfast21 hours ago|

[-]

[dead]

reply

upvote

by hiccuphippo22 hours ago|

[-]

Data that is publicly available also can't be stolen or leaked. Nobody can steal Mozilla's common voice dataset.

reply

upvote

by elevation21 hours ago|

[-]

> The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Except no company is learning this lesson.

The enterprise threat model includes "our own users", and the modus operandi is to maintain as much information on that threat as possible.

reply

upvote

by coolkewlcuil21 hours ago|

[-]

The only winning move is not to play.

reply

upvote

by __alexs21 hours ago|

[-]

Seems a bit like blaming the victim? Your voice (like DNA) is kind of ambient data that's hard to hide.

reply