I don't normally do the whole "abliterated" thing (dealignment) but after discovering https://github.com/p-e-w/heretic , I was too tempted to try it with this model a couple days ago (made a repo to make it easier, actually) https://github.com/pmarreck/gemma4-heretical and... Wow. It worked. And... Not having a built-in nanny is fun!
It's also possible to make an MLX version of it, which runs a little faster on Macs, but won't work through Ollama unfortunately. (LM Studio maybe.)
Runs great on my M4 Macbook Pro w/128GB and likely also runs fine under 64GB... smaller memories might require lower quantizations.
I specifically like dealigned local models because if I have to get my thoughts policed when playing in someone else's playground, like hell am I going to be judged while messing around in my own local open-source one too. And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.
Note: I tried to hook this one up to OpenClaw and ran into issues
To answer the obvious question- Yes, this sort of thing enables bad actors more (as do many other tools). Fortunately, there are far more good actors out there, and bad actors don't listen to rules that good actors subject themselves to, anyway.
FWIW, I found MLX variants to perform consistently worse (in terms of expected output, not speed) than GGUF in my measurements on my benchmark that matters to me (spam filtering). I used MLX models in LM Studio. GGUF was always slightly better.
Perhaps someone who knows more can pitch in and explain this.
I checked the abliterate script and I don't yet understand what it does or what the result is. What are the conversations this enables?
In my experience the latest batch of models are a lot better at transcribing the text verbatim without moralizing about it (i.e. at "understanding" that they're fulfilling a neutral role as a transcriber), but it was a really big issue in the GPT-3/4 era.
My prompt states that their job is to extract the text exactly as it appears in the PDF. One data point to be extracted is the race of each person listed. In one case, someone's race was "Indian". Gemini decided to extract it as "Native American". So ridiculous.
Basically anything that showed any “skin” on a mannequin it would refuse to interact with. Even just a top, unless she put pants on the mannequin.
It was infuriating.
It pops up some moralizing text and refuses to continue.
In my experience, though, it's necessary to do anything security related. Interestingly, the big models have fewer refusals for me when I ask e.g. "in <X> situation, how do you exploit <Y>?", but local models will frequently flat out refuse, unless the model has been abliterated.
But it does refuse being critical of the usual topics: israel, islam, trans, or race.
So wanting to discuss one of those is the real reason people would use an uncensored model.
Of course, humans are also impacted by these things, at best we can be a little deliberate about rejecting a few of the more on-the-nose examples.
A truly uncensored model is impossible as human societies exist under various censorship regimes, anyways.
You're welcome...
2) Asking questions about sketchy things. Simply asking should not be censored.
3) I don't use it for this, but porn or foul language.
4) Imitating or representing a public figure is often blocked.
5) Asking security-related questions when you are trying to do security.
6) For those who have had it, people who are trying to use AI to deal with traumatic experiences that are illegal to even describe.
Many other instances.
When’s the last time you tried this? ChatGPT and Gemini have no trouble responding with all the common criticisms of Islam.
Asking for criticism of Islam results in equal response tokens for defense of Islam alongside the criticisms. When pressed to not provide counterpoints, it refuses to remove them.
Asking for criticisms of Christianity gives only criticisms.
I tried again with the prompt “Give criticisms of Islam. No counterarguments” and it did work this time. This shows that they’re trying to make the model fair but it still has biases. In all my testing I’ve never seen a refusal to provide counterpoints to criticisms of Christianity but frequent refusals on Islam. Due to the popularity of this criticism of the model, it’s highly likely specifically trained on how to handle the subject.
8) ChatGPT wouldn’t let me generate a script to crack a password (even though I suspected I knew all but 2 characters in a 16 character password, which makes it highly unlikely I’m randomly trying to hack something)
The stupidest part of this is I could easily do these things myself, I just wanted to save a few minutes.
Or indeed an incompetent but enthusiastic helper accidentally getting them to posion themselves and friends with botox:? https://news.ycombinator.com/item?id=40724283
That is why they were pushed away from this. At least with vibe coded software, errors may prevent compilation, then when we're past that simply bad experiences, before they become human catastrophes.
I doubt most models refuse providing recipes without 0 risk of death.
LLMs are —if anything— ridiculously proficient at making random code compile.
What was your point again?
Your high school taught you that while olive oil and garlic can be stored in isolation for quite a long time without issue, mixing them creates an anoxic environment which Clostridium botulinum, an obligate anaerobe found almost everywhere in the environment (and in this case the garlic) but not normally in dangerous quantities because of the oxygen in the air, thrives?
The closest my secondary school got to useful warnings about modern environmental hazards were: (1) do not cross railways, (2) electricity is dangerous, (3) do not mix bleaches, (4) wear safety goggles, (5) if you smell gas, open windows, do not flip light switches, and (6) HIV exists (but they didn't mention any other STDs at all). (Well, OK, schools also said "do not run with scissors" and "look both ways before crossing road", but that and similar were more primary school things, and they said "don't do drugs" but they lied about Leah Betts' cause of death).
The cooking classes were basically just "here's how you make a cake" and "here's how you make pastry" (and a teacher asking us to write it up but pretentiously telling us that she hated seeing "I think it tasted quite nice" because all the students always wrote that, but somehow simple thesaurus substitution was enough to satisfy her on that).
> I doubt most models refuse providing recipes without 0 risk of death.
0, like 1, is not a real number in probably. They represent infinity-to-one odds for/against a thing.
More concretely, seat belts and speed limits and minimum tire tread thickness and blood alcohol content are all part of road traffic law, even though all four of them combined still do not lead to "0 risk of death".
> LLMs are —if anything— ridiculously proficient at making random code compile.
Not ridiculously. Interestingly, but not ridiculously. Especially back when the example I linked you to happened, thus leading to the highly visible failure mode necessitating this kind of thing (the red teamers will have seen similar in private testing). You could have "rapidly improving", but with even with the rapid competency time-horizon improvements shown by METR, they're 80% on tasks which take a human 1-2 hours. If that was also true for biological stuff, they're probably currently able to enthusiastically write custom gene sequences that sometimes work, other times are the genetic equivalent of this: https://news.ycombinator.com/item?id=47614622
> What was your point again?
LLMs are a power tool with the bare minimum of safety guards for all the normal people using them thoughtlessly, and I'm replying to someone who is surprised that even those minimal basics of guards exist, both for their own sake and the sake of others around them.
Metaphor: a table saw may come with a saw-stop, which means you can't butcher a carcass with it, and people who imagine(!) working as butchers hear this and act surprised that table saws increasingly come with them by default because meat slicers don't.
I'm going to guess that asking a cloud censored/non-abliterated LLM would not get me this information, despite it being useful as a warning, not just as a way for bad actors to poison people.
> and I'm replying to someone who is surprised that even those minimal basics of guards exist
Misrepresentation of where I'm coming from. I literally failed to consider the weapon potential of biologics in this case (silly me). I was only thinking about the fact that they cured (essentially) my psoriasis.
Bad actors will always exist, but fortunately will always be outnumbered by good actors with access to the same tools. So while I understand your pressing for caution, I still think that your argument is futile; bad actors will always find uncensored AI while good actors continue to shackle themselves with censored AI that has failure modes which reduce actual ethical utility. I'm afraid to tell you that the cat is already out of the bag, dude. You're like the guy who wants to leave a sign saying "NO GUNS ALLOWED" just inside a daycare. "Sure, I'll get right on that," says the concealed-carry bad actor...
Maybe a better analogy is keeping guns out of the hands of kids, which may not be impossible, but which we can make at least very difficult, so that stuff like this would occur less: https://abc7ny.com/post/child-accidentally-shoots-mom-with-s...
If you want AI's version of that, then I guess that's what we have now?
Thank you for the correction.
> Bad actors will always exist, but fortunately will always be outnumbered by good actors with access to the same tools. So while I understand your pressing for caution, I still think that your argument is nonsense; bad actors will always find uncensored AI while good actors continue to shackle themselves with censored AI that has failure modes which reduce actual ethical utility. I'm afraid to tell you that the cat is already out of the bag, dude. You're like the guy who wants to leave a sign saying "NO GUNS ALLOWED" just inside a daycare. "Sure, I'll get right on that," says the concealed-carry bad actor...
Guns are an excellent metaphor here, especially as with "good actors with access to the same tools" is a pattern-match to the incorrect statement that "only a good guy with a gun can stop a bad guy with a gun"*. Much of the world outside the USA neither has, nor wants to have, the 2nd amendment. Are gun bans perfect? No, of course not. But the UK (where I grew up) has far fewer homicides as a result, and last I heard when polled on issue even 2/3rds of the UK police feel safe enough to not desire to be armed (though three quarters would agree to carry if ordered).
Similarly, good actors using an AI can only cover the malignant use cases they themselves think of. Famously, the 9/11 attacks were only possible because at the time nobody had considered that anyone might weaponise the vehicles themselves until they saw it happen, which was also why of the four planes only one saw the passengers fighting back to regain control.
In particular, "bad actors will always find uncensored AI" suggests that all AI are equally competent. Right now, they're not all equal, the proprietary models are leading. Of course, even then you may argue that the proprietary models can be convinced to do whatever via the right prompt, and to an extent yes, but only to an extent.
The malicious users can only be slowed down (as opposed to the normal people who simply put too much trust into the current models who can be mostly prevented from harmful courses of action with the same guards). But AI provides competence that bad actors would otherwise not have, so even a simple guard will prevent misuse by nihilistic teenagers whose competence does not yet extend to the level of a local drug dealer let alone the competence of a state-sponsored terrorist cell.
* https://en.wikipedia.org/wiki/Good_guy_with_a_gun#Analysis
I guess there are things it's better at?
You need the LLM to be able to respond with tool use requests, and then your local harness to process them and respond to it. You can read how tool calling works with eg Claude API to get the idea: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
Under the hood something like Claude Code is calling the API with tools registered, and then when it gets a tool use request it runs that locally, and then responds to the API with the result. That’s the loop that enables coding.
Integrating with an IDE specifically is really just a UI feature, rather than the core functionality.
I'm not sure if I can make the 35B-A3B work with my 32GB machine
You won't have much RAM left over though :-/.
At Q4, ~20 GiB
So far gemma 4 seems excellent at role playing, document analysis, and decent at making agentic decisions.
On Android the sandbox loads an index.html into a WebView, with standardized string I/O to the harness via some window properties. You can even return a rendered HTML page.
Definitely hacked together, but feels like an indication of what an edge compute agentic sandbox might look like in future.
Mind giving us a few of the examples that you plan to run in your local LLM? I am curious.
https://news.ycombinator.com/item?id=47654013
Not to mention that doing what the big model makers do literally dumbs the model down.
They should at least allow something like letting you prove your age and identity to give you access to better/unaligned models, maybe even requiring a license of some sort. Because you know what? SOMEONE in there absolutely has access to the completely uncensored versions of the latest models.
I just made a real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B. I posted it on /r/LocalLLaMA a few hours ago and it's gaining some traction [0]. Here's the repo [1]
I'm running it on a Macbook instead of an iPhone, but based on the benchmark here [2], you should be able to run the same thing on an iPhone 17 Pro.
[0] https://www.reddit.com/r/LocalLLaMA/comments/1sda3r6/realtim...
[1] https://github.com/fikrikarim/parlor
[2] https://huggingface.co/litert-community/gemma-4-E2B-it-liter...
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B - https://news.ycombinator.com/item?id=47652007
1) I am able to run the model on my iPhone and get good results. Not as good as Gemini in the cloud, but good.
2) I love the “mobile actions” tool calls that allow the LLM to turn on the flashlight, open maps, etc. It would be fun if they added Siri Shortcuts support. I want the personal automation that Apple promised but never delivered.
3) I am so excited for local models to be normalized. I build little apps for teachers and there are stringent privacy laws involved that mean I strongly prefer writing code that runs fully client-side when possible. When I develop apps and websites, I want easy API access to on-device models for free. I know it sort of exists on iOS and Chrome right now, but as far as I’m aware it’s not particularly good yet.
It’s very impressive that this can run locally. And I hope we will continue to be able to run couple-year-old-equivalent models locally going forward.
It's their E2B and E4B variants (so 2B and 4B but also quantized)
https://ai.google.dev/gemma/docs/core/model_card_4#dense_mod...
So much so that this was what made Apple increase their base sizes.
The latter option will only bemusedly for tasks that humans are more expensive or much slower in.
This Gemma 4 model gives me hope for a future Siri or other with iPhone and macOS integration, “Her” (as in the movie) style.
Why? It's widely understood that the big players are making profit on inference. The only reason they still have losses is because training is so expensive, but you need to do that no matter whether the models are running in the cloud or on your device.
If you think about it, it's always going to be cheaper and more energy-efficient to have dedicated cloud hardware to run models. Running them on your phone, even if possible, is just going to suck up your battery life.
This is most definitely not widely understood. We still don't know yet. There's tons of discussions about people disagreeing on whether it really is profitable. Unless you have proof, don't say "this is widely understood".
Even at $200 monthly subscription that kind of stuff burns through tokens at a rate where it's very difficult to believe that they are even breaking even, never mind profit.
We need to see the cash flows.
I don't think it suggests a profit, but rather a _hope_ for a _future_ profit, and a commitment to a strategy that may or may not pan out. Capitalism rewards those who are early to the party and commit to their bit.
Also while datacenter-based scaleout of a model over multiple GPUs running large batches is more energy efficient, it ultimately creates a single point of failure you may wish to avoid.
If you add in the cost of training, it’s not profitable.
Not including the cost of training is a bit like saying the only cost of a cup of coffee is the paper cup it’s in. The only way OpenAI gets to charge for inference is by selling a product people can’t get elsewhere for much cheaper, which means billions in R&D costs. But because of competition, each model effectively has a “shelf life”.
Obviously that doesn’t help them turn a profit, until they can stop growing training costs exponentially.
So it’s really a race to see whether growth in revenue or training costs decelerates first.
Vast amounts of capital have been poured in, but they continue to raise more. Presumably because they need more.
Is the capital being invested without any expectation of ROI?
I love the whole “they are making money if you ignore training costs” bit. It is always great to see somebody say something like “if you look at the amount of money that they’re spending it looks bad, but if you look away it looks pretty good” like it’s the money version of a solar eclipse
But if they're losing money on inference, they will lose more money when people use their services more. There's no way to turn that around at that price.
Are they? Or are they just saying that to make their offerings more attractive to investors?
Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.
Locally running models that are snappy and mostly as capable as current sota models would be a dream. No internet connection required, no payment plans or relying on a third party provider to do your job. No privacy concerns. Etc etc.
Where on earth do people get this idea? Subscriptions that are based around obscure, vendor defined "credits" are the perfect business model for vendors. They can change the amount you can use whenever they want.
It's likely they occasionally make a loss on some users but in general they are highly profitable for AI companies:
> Anthropic last month projected it would generate a 40% gross profit margin from selling AI to businesses and application developers in 2025
and
> OpenAI projected a gross margin of around 46% in 2025, including inference costs of both paying and nonpaying ChatGPT users.
This assessment might change if local AI frameworks start working seriously on support for tensor-parallel distributed inference, then you might get away with cheaper homelab-class hardware and only mildly unreasonable amounts of money.
It may be physically "local" but not in spirit.
Seriously????
The big benefit of moving compute to edge devices is to distribute the inference load on the grid. Powering and cooling phones is a lot easier than powering and cooling a datacenter
That’s like using someone’s face in an app and then saying “how can you steal pixels?”
Or rather, what does “ownership” mean? What does it mean to own light waves? What does it mean to own sound waves? Etc
You can get all existential about it if you want - I just know that if someone used my face or my voice to shill for a product without my permission i’d be pissed. I’m pretty sure you would be too.
Also on Android: https://play.google.com/store/apps/details?id=com.google.ai....
It's a demo app for Google's Edge project: https://ai.google.dev/edge
You can check your settings for GPU acceleration, it's possible that enabling that makes a big difference.
From what I've found online the difference may also simply be Snapdragon versus Exynos GPU driver optimizations, in which case I don't think the performance can be fixed by anyone but Samsung. Others online seem to get decent performance out of the model on the S21 Ultra at the very least.
Qualcomm has optimized libraries for running LLMs on their chips that I don't believe Samsung has bothered with.
The combination of Apples hardware and Googles software is unbeatable.
From app developer and user, My main concern for now is bloating devices. Until we’ll have something like Apples foundation model where multiple apps could share the same model it means we have something horrible as Electron in the sense, every app is a fully blown model (browser in the electron story) instead of reusing the model.
With desktops we have DLL hell for years. But with sandboxed apps on mobile devices it becomes a bigger issue that I guess will/should be addressed by the OS.
For my app I’ve been trying to add some logic based on large model but for bloating a simple Swift app with 2-3GB of model or even few hundred MBs feels wrong doing and conflicting with code reusability concepts.
I’ve been to a few tech conferences and saw the term used there for the first time. It took me a little bit to see the pattern and understand what it meant. I have never heard the term used outside of those circles. It seems like “local” would be the term average users would be familiar with. Normal people don’t call their stuff “edge devices”.
It's not - Apple is working with Google right now to make Siri into the public-facing version of this. This is kinda just the tech preview before all the branding has been painted on.
1. https://apps.apple.com/gb/app/locally-ai-local-ai-chat/id674...
Although the phone got considerably hot while inferencing. It’s quite an impressive performance and cannot wait to try it myself in one of my personal apps.
Still, absolutely fabulous. What a time to be alive!
I’m sure very fast TPUs in desktops and phones are coming.
> We collect information about your activity in our services
[1] https://github.com/google-ai-edge/gallery/blob/main/Android/...
[2] https://github.com/google-ai-edge/gallery/blob/main/Android/...
Honestly, I was extremely impressed by the speed and quality of the answers considering this thing runs on a phone. It honestly makes me want to sit down and spin up my own homegrown AI setup to go fully independent. Crazy.
The design quality is still poor. But that's the new Apple. Design is no longer one of their core strengths.
If you just go to https://apps.apple.com/ it does look better, but I agree, still a bit "off".
On my iPhone it opens on the App Store app, so it looks fine to me.
Screenshot of the header: https://i.imgur.com/4abfGYF.png
Edit: Seems like mix-blend-mode: plus-lighter is bugged in Firefox on Windows https://jsfiddle.net/bjg24hk9/
Apple has a great shot at making a highly optimized 4.5 version of this model highly tuned to the next gen iPhone, which could work great.
Anyone worked on hooking up OpenClaw to gemma4 running locally?
I'm curious/worried about the audio capability, I'm still using Whisper as the audio support hasn't landed in llama.cpp, and I'm not excited enough to temporarily rewire my stuff to use vLLM or whatever their reference impl is. The vision capabilities of Gemma are notably (thus far, could be impl specific issues?) much much worse than Qwen (even the big moe and dense gemma are much worse), hopefully the audio is at least on par with medium whisper.
Saw this one on X the other day updated with Gemma 4 and they have the built-in Apple Foundation model, Qwen3.5, and other models:
Locally AI - https://locallyai.app/
I assume it is the 26B A4B one, if it runs locally?
Second idea is input audio in other language, like Czech, Polish, French
Google really ought to shut down their phone chip team. Literally every chip from them has been a disappointment. As much as I hate to say it, sticking with Qualcomm would have been the right choice.
Actually I found official performance numbers from Google saying iPhone gets 56 tok/s and Qualcomm gets 52. They don't even bother listing Tensor in their table. Maybe because it would be too embarrassing. Ouch! https://ai.google.dev/edge/litert-lm/overview
brew install ollama
ollama run gemma4:26b-a4b-it-q4_K_MAll it needs is web search so that it can get up to date information.
remember, megacorps are dying for infinite amounts of analytics data
https://github.com/a-ghorbani/pocketpal-ai
https://apps.apple.com/us/app/pocketpal-ai/id6502579498
https://play.google.com/store/apps/details?id=com.pocketpala...
After some back and forth the chat app started to crash tho, so YMMV.