The whole "hosted AI" business feels like like a huge violation of corporate norms on confidentiality. Businesses that would have your head for printing out a source file to reference and annotate are encouraging developers to feed in huge amounts of proprietary code and data, and incorporate changes suggested from an outside party with minimal vetting. Evidently whatever privacy policies they've been throwing at enterprise users are plated with mithril.
At some point, one of the big services is going to get popped, and it won't just be a data breach. There's too much opportunity to quietly use the system as a malware distribution hub. Every vibe-coded dashboard suddenly starts depending on some weird left-pad fork that, 12 dependencies deep, is running a keylogger or Dogecoin miner. Your payment processor suddenly starts accepting the Konami code to approve a transaction.
A local model you trained yourself seems about as good as you can do today.
But it may not even be possible to fully trust a model you trained if you used untrusted data during training.
As a user, you have to trust your coding agent AND inference provider AND models: https://jacob.gold/posts/coding-models-are-code/ https://www.anthropic.com/research/sleeper-agents-training-d...
It's unfathomable to me that EU companies don't take the risk of industrial espionage from US more seriously
Of course those are largely the same companies that receive emails via outlook, manage company-wide SSO in Microsoft Entra, put their files in Sharepoint and track software and maintenance issues in Jira ... I'm not sure how much much info there is left that isn't already combed through by NSA and friends
There might be some valid concerns about model alignment, but at least the model running in-house isn't going to conduct espionage.
This only really matters in a world where Prompt Injection and Jailbreaking isn't trivial in the first place though. All current models are still extremely exploitable.
I strongly suspect we are only scratching the surface of activation engineering at the moment, and there's plenty of very targetted ways of lobotomizing or cracking LLMs if you understand the model in detail.
Not impossible I agree, but seems like a really impractical way to ship a trojan while much weaker channels exist.
If a token compresses to around a byte, worldwide AI input and output is around 1 gigabyte per second.
For any intelligence agency, they can afford to keep and store all of that forever, and later do analysis on it.
At the scale the AI companies are operating at, I think it isn't likely that they are sucking it all in right now.
More likely I think the intelligence agencies will get a real-time live tap into the raw data feed which they will process onsite for interesting things and then if things are flagged, they will log it in the intelligence agency systems.
that's why you should use abliterated heretic models
My favorite conspiracy is that three letter agencies keep pushing the conspiracy that they are omni-present with access to everything. Same as parents telling their kids Santa is watching, and leaders telling adults God is watching. Its extremely effective control and millennia old at this point.
The reality is much more banal that they still need warrants and tech companies hate playing police/evidence servant for the government (it consumes a ton of resources and pays nothing).
The snowden leaks revealed that's not the case.
The three letter agencies can just issue national security letters without a judge ever seeing it, and those come a long with a gag order (plus other workarounds like just buying data from brokers, and how US communications can get swept up just by virtue of communicating with a foreign national outside the US).
You're right, they aren't omniscient in the way we imagine of a room full of people monitoring everything in real time. But to pretend they aren't passively collecting massive amounts of data is dangerous. Snowden showed us PRISM, with all major tech companies participating. They do effectively have a live, unrestricted wiretap to the internet and if you happen to be a person of interest, they will just send out NSLs and get all your communications that are not fully E2EE without you even knowing thanks to the gag order.
I'll provide some helper information to get the ball rolling (see page 42)[1]
[1]https://www.intelligence.gov/assets/documents/702-documents/...
All the other prime suspects are in the report too for the curious.
I will not elaborate how I know, but that is not even directionally correct. But these are not even secret things that can’t be known simply through the Snowden, Wikileaks, and Vault7 releases. So why are you telling yourself this? Are you still wet behind the ears or something?
There are people who know exactly how governments do not in fact need warrants and the tech companies don’t even really know they are servants to the government, let alone which one. That’s how things are done. The less surface area the better.