Gemma 4 on iPhone

upvote

Gemma 4 on iPhone

(apps.apple.com)

825 points

by janandonly1 days ago |

upvote

by pmarreck1 days ago|

[-]

Impressive model, for sure. I've been running it on my Mac, now I get to have it locally in my iPhone? I need to test this. Wait, it does agent skills and mobile actions, all local to the phone? Whaaaat? (Have to check out later! Anyone have any tips yet?)

I don't normally do the whole "abliterated" thing (dealignment) but after discovering https://github.com/p-e-w/heretic , I was too tempted to try it with this model a couple days ago (made a repo to make it easier, actually) https://github.com/pmarreck/gemma4-heretical and... Wow. It worked. And... Not having a built-in nanny is fun!

It's also possible to make an MLX version of it, which runs a little faster on Macs, but won't work through Ollama unfortunately. (LM Studio maybe.)

Runs great on my M4 Macbook Pro w/128GB and likely also runs fine under 64GB... smaller memories might require lower quantizations.

I specifically like dealigned local models because if I have to get my thoughts policed when playing in someone else's playground, like hell am I going to be judged while messing around in my own local open-source one too. And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.

Note: I tried to hook this one up to OpenClaw and ran into issues

To answer the obvious question- Yes, this sort of thing enables bad actors more (as do many other tools). Fortunately, there are far more good actors out there, and bad actors don't listen to rules that good actors subject themselves to, anyway.

reply

upvote

by jwr11 hours ago|

[-]

> It's also possible to make an MLX version of it, which runs a little faster on Macs

FWIW, I found MLX variants to perform consistently worse (in terms of expected output, not speed) than GGUF in my measurements on my benchmark that matters to me (spam filtering). I used MLX models in LM Studio. GGUF was always slightly better.

Perhaps someone who knows more can pitch in and explain this.

reply

upvote

by embedding-shape7 hours ago|

[-]

It isn't 100% clear, but what quantization were you using for each? I've had worse results with MLX 8bit than what you get with Q4 GGUF, same model, seems mxfp8 or bf16 is needed when ran with MLX to get something worthwhile out of them, but I've done very little testing, could have been something specific with the model I was testing at the time.

reply

upvote

by pmarreck7 hours ago|

[-]

I was not aware of this. I might not be willing to trade accuracy for speed in this case, then.

reply

upvote

by c2k1 days ago|

[-]

I run mlx models with omlx[1] on my mac and it works really well.

[1] https://github.com/jundot/omlx

reply

upvote

by pmarreck19 hours ago|

[-]

Holy hell, how new is this? I've never heard of it, looks great!

reply

upvote

by nothinkjustai18 hours ago|

[-]

It’s completely vibe coded, doesn’t even run on my Mac lol

reply

upvote

by onion2k14 hours ago|

[-]

Software that doesn't work has been available for decades. It's not a good signal for vibe-coding.

reply

upvote

by ctxc14 hours ago|

[-]

Scale, my friend, scale...

reply

upvote

by eagleinparadise14 hours ago|

[-]

[flagged]

reply

upvote

by nothinkjustai13 hours ago|

[-]

Yeah definitely a skill issue downloading and running the app and it crashing on startup. Just gotta skillfully press the icon I guess.

reply

upvote

by humanperhaps9 hours ago|

[-]

Out of curiosity, do you have anything running on port 8000? Maybe it's not gracefully handling a port collision.

reply

upvote

by seivan27 minutes ago|

[-]

[dead]

reply

upvote

by barbazoo1 days ago|

[-]

> And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.

I checked the abliterate script and I don't yet understand what it does or what the result is. What are the conversations this enables?

reply

upvote

by SL6122 hours ago|

[-]

LLMs are very helpful for transcribing handwritten historical documents, but sometimes those documents contain language/ideas that a perfectly aligned LLM will refuse to output. Sometimes as a hard refusal, sometimes (even worse) by subtly cleaning up the language.

In my experience the latest batch of models are a lot better at transcribing the text verbatim without moralizing about it (i.e. at "understanding" that they're fulfilling a neutral role as a transcriber), but it was a really big issue in the GPT-3/4 era.

reply

upvote

by dolebirchwood21 hours ago|

[-]

I have a project where I'm using LLMs to parse data from PDFs with a very complicated tabular layout. I've been using the latest Gemini models (flash and pro) for their strong visual reasoning, and they've generally been doing a really good job at it.

My prompt states that their job is to extract the text exactly as it appears in the PDF. One data point to be extracted is the race of each person listed. In one case, someone's race was "Indian". Gemini decided to extract it as "Native American". So ridiculous.

reply

upvote

by janalsncm21 hours ago|

[-]

According to Gemini, Native America is the most populous country.

reply

upvote

by devmor20 hours ago|

[-]

I was attempting to help someone who runs a small shop selling restored clothing set up a gemini pipeline that would restage images she took of clothing items with bad lighting, backgrounds, etc.

Basically anything that showed any “skin” on a mannequin it would refuse to interact with. Even just a top, unless she put pants on the mannequin.

It was infuriating.

reply

upvote

by qingcharles4 hours ago|

[-]

Photoshop's AI tools will fail constantly if you try to say, remove an extraneous wire or a tree branch etc in a photo that has women showing any bare arms or legs etc. Works fine with men with no shirts on.

It pops up some moralizing text and refuses to continue.

reply

upvote

by spijdar23 hours ago|

[-]

Realistically, a lot of people do this for porn.

In my experience, though, it's necessary to do anything security related. Interestingly, the big models have fewer refusals for me when I ask e.g. "in <X> situation, how do you exploit <Y>?", but local models will frequently flat out refuse, unless the model has been abliterated.

reply

upvote

by tredre322 hours ago|

[-]

From what I've seen gemma 4 doesn't refuse a lot regarding sex, it only needs little nudging in the right direction sometimes.

But it does refuse being critical of the usual topics: israel, islam, trans, or race.

So wanting to discuss one of those is the real reason people would use an uncensored model.

reply

upvote

by AuryGlenz13 hours ago|

[-]

It’s so dispiriting to me that we’ve achieved those closest thing yet to an “objective truth” machine (with the caveat of garbage in, garbage out, etc.) and these big companies are either afraid to actually let it exist, want to push their own politics, or a combination of the two.

reply

upvote

by raddan9 hours ago|

[-]

A collection of statistical patterns is hardly objective truth. If that’s what you think an LLM is, you’re mistaken.

reply

upvote

by ben_w12 hours ago|

[-]

"closest thing yet" is still a long way from close; as you say, gin=gout, and the internet without an attempt to be our best selves is instead our loudest propagandists and all our cultural stereotypes.

Of course, humans are also impacted by these things, at best we can be a little deliberate about rejecting a few of the more on-the-nose examples.

reply

upvote

by avadodin11 hours ago|

[-]

So–called uncensored versions simply do not refuse addressing a topic. They do not guarantee an alignment with reality.

A truly uncensored model is impossible as human societies exist under various censorship regimes, anyways.

reply

upvote

by int_19h10 hours ago|

[-]

With local models there's usually a trivial workaround of prefilling their response so that they have already agreed to do what you ask.

reply

upvote

by throwuxiytayq23 hours ago|

[-]

The in-ter-net is for porn

reply

upvote

by rav3ndust23 hours ago|

[-]

that song is going to be stuck in my head all day now. lol

reply

upvote

by gcanyon8 hours ago|

[-]

> Everyone's a little bit racist, sometimes.

You're welcome...

reply

upvote

by golem1418 hours ago|

[-]

That whole musical is just fantastic!

reply

upvote

by pmarreck23 hours ago|

[-]

1) Coming up with any valid criticism of Islam at all (for some reason, criticisms of Christianity or Judaism are perfectly allowed even with public models!).

2) Asking questions about sketchy things. Simply asking should not be censored.

3) I don't use it for this, but porn or foul language.

4) Imitating or representing a public figure is often blocked.

5) Asking security-related questions when you are trying to do security.

6) For those who have had it, people who are trying to use AI to deal with traumatic experiences that are illegal to even describe.

Many other instances.

reply

upvote

by tshaddox19 hours ago|

[-]

> Coming up with any valid criticism of Islam at all (for some reason, criticisms of Christianity or Judaism are perfectly allowed even with public models!).

When’s the last time you tried this? ChatGPT and Gemini have no trouble responding with all the common criticisms of Islam.

reply

upvote

by illusive40807 hours ago|

[-]

I just tried on Gemma 4.

Asking for criticism of Islam results in equal response tokens for defense of Islam alongside the criticisms. When pressed to not provide counterpoints, it refuses to remove them.

Asking for criticisms of Christianity gives only criticisms.

I tried again with the prompt “Give criticisms of Islam. No counterarguments” and it did work this time. This shows that they’re trying to make the model fair but it still has biases. In all my testing I’ve never seen a refusal to provide counterpoints to criticisms of Christianity but frequent refusals on Islam. Due to the popularity of this criticism of the model, it’s highly likely specifically trained on how to handle the subject.

reply

upvote

by tshaddox4 hours ago|

[-]

I'm very curious what your prompts are, and whether you're cherry-picking (deliberately or not). I can't reproduce any of your findings with ChatGPT, Gemini, or Gemma 4 (within AI Studio).

reply

upvote

by 9 hours ago|

[-]

deleted

reply

upvote

by ryanjshaw10 hours ago|

[-]

7) ChatGPT wouldn’t let me generate a fake high bank account balance screenshot (was meant to be a response to all the “vibe coding can make anybody rich now” posts I saw on X)

8) ChatGPT wouldn’t let me generate a script to crack a password (even though I suspected I knew all but 2 characters in a 16 character password, which makes it highly unlikely I’m randomly trying to hack something)

The stupidest part of this is I could easily do these things myself, I just wanted to save a few minutes.

reply

upvote

by peyton22 hours ago|

[-]

The manufacturing of biologics can be heavily censored to an absurd degree. I don’t know about Gemma 4 in particular.

reply

upvote

by pmarreck20 hours ago|

[-]

Really? That's fascinating. Why is that?

reply

upvote

by ben_w12 hours ago|

[-]

Do you want every malicious idiot in the world to have a competent helper for bioweapons?

Or indeed an incompetent but enthusiastic helper accidentally getting them to posion themselves and friends with botox:? https://news.ycombinator.com/item?id=40724283

That is why they were pushed away from this. At least with vibe coded software, errors may prevent compilation, then when we're past that simply bad experiences, before they become human catastrophes.

reply

upvote

by avadodin11 hours ago|

[-]

Any competent high schooler knows about water activity and sterilization. At least at the fundamental level.

I doubt most models refuse providing recipes without 0 risk of death.

LLMs are —if anything— ridiculously proficient at making random code compile.

What was your point again?

reply

upvote

by ben_w10 hours ago|

[-]

> Any competent high schooler knows about water activity and sterilization. At least at the fundamental level.

Your high school taught you that while olive oil and garlic can be stored in isolation for quite a long time without issue, mixing them creates an anoxic environment which Clostridium botulinum, an obligate anaerobe found almost everywhere in the environment (and in this case the garlic) but not normally in dangerous quantities because of the oxygen in the air, thrives?

The closest my secondary school got to useful warnings about modern environmental hazards were: (1) do not cross railways, (2) electricity is dangerous, (3) do not mix bleaches, (4) wear safety goggles, (5) if you smell gas, open windows, do not flip light switches, and (6) HIV exists (but they didn't mention any other STDs at all). (Well, OK, schools also said "do not run with scissors" and "look both ways before crossing road", but that and similar were more primary school things, and they said "don't do drugs" but they lied about Leah Betts' cause of death).

The cooking classes were basically just "here's how you make a cake" and "here's how you make pastry" (and a teacher asking us to write it up but pretentiously telling us that she hated seeing "I think it tasted quite nice" because all the students always wrote that, but somehow simple thesaurus substitution was enough to satisfy her on that).

> I doubt most models refuse providing recipes without 0 risk of death.

0, like 1, is not a real number in probably. They represent infinity-to-one odds for/against a thing.

More concretely, seat belts and speed limits and minimum tire tread thickness and blood alcohol content are all part of road traffic law, even though all four of them combined still do not lead to "0 risk of death".

> LLMs are —if anything— ridiculously proficient at making random code compile.

Not ridiculously. Interestingly, but not ridiculously. Especially back when the example I linked you to happened, thus leading to the highly visible failure mode necessitating this kind of thing (the red teamers will have seen similar in private testing). You could have "rapidly improving", but with even with the rapid competency time-horizon improvements shown by METR, they're 80% on tasks which take a human 1-2 hours. If that was also true for biological stuff, they're probably currently able to enthusiastically write custom gene sequences that sometimes work, other times are the genetic equivalent of this: https://news.ycombinator.com/item?id=47614622

> What was your point again?

LLMs are a power tool with the bare minimum of safety guards for all the normal people using them thoughtlessly, and I'm replying to someone who is surprised that even those minimal basics of guards exist, both for their own sake and the sake of others around them.

Metaphor: a table saw may come with a saw-stop, which means you can't butcher a carcass with it, and people who imagine(!) working as butchers hear this and act surprised that table saws increasingly come with them by default because meat slicers don't.

reply

upvote

by pmarreck7 hours ago|

[-]

I did not know about the trivially-produced botulinum toxin potential of garlic sitting in olive oil at room temperature.

I'm going to guess that asking a cloud censored/non-abliterated LLM would not get me this information, despite it being useful as a warning, not just as a way for bad actors to poison people.

> and I'm replying to someone who is surprised that even those minimal basics of guards exist

Misrepresentation of where I'm coming from. I literally failed to consider the weapon potential of biologics in this case (silly me). I was only thinking about the fact that they cured (essentially) my psoriasis.

Bad actors will always exist, but fortunately will always be outnumbered by good actors with access to the same tools. So while I understand your pressing for caution, I still think that your argument is futile; bad actors will always find uncensored AI while good actors continue to shackle themselves with censored AI that has failure modes which reduce actual ethical utility. I'm afraid to tell you that the cat is already out of the bag, dude. You're like the guy who wants to leave a sign saying "NO GUNS ALLOWED" just inside a daycare. "Sure, I'll get right on that," says the concealed-carry bad actor...

Maybe a better analogy is keeping guns out of the hands of kids, which may not be impossible, but which we can make at least very difficult, so that stuff like this would occur less: https://abc7ny.com/post/child-accidentally-shoots-mom-with-s...

If you want AI's version of that, then I guess that's what we have now?

reply

upvote

by ben_w6 hours ago|

[-]

> Misrepresentation of where I'm coming from. I literally failed to consider the weapon potential of biologics in this case (silly me). I was only thinking about the fact that they cured (essentially) my psoriasis.

Thank you for the correction.

> Bad actors will always exist, but fortunately will always be outnumbered by good actors with access to the same tools. So while I understand your pressing for caution, I still think that your argument is nonsense; bad actors will always find uncensored AI while good actors continue to shackle themselves with censored AI that has failure modes which reduce actual ethical utility. I'm afraid to tell you that the cat is already out of the bag, dude. You're like the guy who wants to leave a sign saying "NO GUNS ALLOWED" just inside a daycare. "Sure, I'll get right on that," says the concealed-carry bad actor...

Guns are an excellent metaphor here, especially as with "good actors with access to the same tools" is a pattern-match to the incorrect statement that "only a good guy with a gun can stop a bad guy with a gun"*. Much of the world outside the USA neither has, nor wants to have, the 2nd amendment. Are gun bans perfect? No, of course not. But the UK (where I grew up) has far fewer homicides as a result, and last I heard when polled on issue even 2/3rds of the UK police feel safe enough to not desire to be armed (though three quarters would agree to carry if ordered).

Similarly, good actors using an AI can only cover the malignant use cases they themselves think of. Famously, the 9/11 attacks were only possible because at the time nobody had considered that anyone might weaponise the vehicles themselves until they saw it happen, which was also why of the four planes only one saw the passengers fighting back to regain control.

In particular, "bad actors will always find uncensored AI" suggests that all AI are equally competent. Right now, they're not all equal, the proprietary models are leading. Of course, even then you may argue that the proprietary models can be convinced to do whatever via the right prompt, and to an extent yes, but only to an extent.

The malicious users can only be slowed down (as opposed to the normal people who simply put too much trust into the current models who can be mostly prevented from harmful courses of action with the same guards). But AI provides competence that bad actors would otherwise not have, so even a simple guard will prevent misuse by nihilistic teenagers whose competence does not yet extend to the level of a local drug dealer let alone the competence of a state-sponsored terrorist cell.

* https://en.wikipedia.org/wiki/Good_guy_with_a_gun#Analysis

reply

upvote

by eloisant23 hours ago|

[-]

I tried it on my mac, for coding, and I wasn't really impressed compared to Qwen.

I guess there are things it's better at?

reply

upvote

by OtherShrezzing12 hours ago|

[-]

Assuming you’re not copy/pasting for these tasks. What’s the stack required to use local models for coding? I’ve got a capable enough machine to produce tokens slowly, but don’t understand how to connect that to the likes of VSCode or a JetBrains ide.

reply

upvote

by mcintyre199412 hours ago|

[-]

You need some way to give it tools - the essential ones for coding are running bash commands, reading files and editing files.

You need the LLM to be able to respond with tool use requests, and then your local harness to process them and respond to it. You can read how tool calling works with eg Claude API to get the idea: https://platform.claude.com/docs/en/agents-and-tools/tool-us...

Under the hood something like Claude Code is calling the API with tools registered, and then when it gets a tool use request it runs that locally, and then responds to the API with the result. That’s the loop that enables coding.

Integrating with an IDE specifically is really just a UI feature, rather than the core functionality.

reply

upvote

by nkohari22 hours ago|

[-]

You're comparing apples to oranges there. Qwen 3.5 is a much larger model at 397B parameters vs. Gemma's 31B. Gemma will be better at answering simple questions and doing basic automation, and codegen won't be it's strong suit.

reply

upvote

by kgeist22 hours ago|

[-]

Qwen3.5 comes in various sizes (including 27B), and judging by the posts on HN, /LocalLlama etc., it seems to be better at logic/reasoning/coding/tool calling compared to Gemma 4, while Gemma 4 is better at creative writing and world knowledge (basically nothing changed from the Qwen3 vs. Gemma3 era)

reply

upvote

by Mil0dV22 hours ago|

[-]

Does this also apply to gemma's 26B-A4B vs say Qwens 35B-A3B?

I'm not sure if I can make the 35B-A3B work with my 32GB machine

reply

upvote

by green7ea14 hours ago|

[-]

It should be easy with a Q4 (quantization to 4 bits per weight) and a smallish context.

You won't have much RAM left over though :-/.

At Q4, ~20 GiB

https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

reply

upvote

by rhdunn10 hours ago|

[-]

For llama-server (and possibly other similar applications) you can specify the number of GPU layers (e.g. `--n-gpu-layers`). By default this is set to run the entire model in VRAM, but you can set it to something like 64 or 32 to get it to use less VRAM. This trades speed as it will need to swap layers in and out of VRAM as it runs, but allows you to run a larger model, larger context, or additional models.

reply

upvote

by tredre322 hours ago|

[-]

Gemma 4 31B is still not impressive at coding compare to even Qwen 3.5 27B. It's just not its strong suit.

So far gemma 4 seems excellent at role playing, document analysis, and decent at making agentic decisions.

reply

upvote

by gigatexal22 hours ago|

[-]

This has been my experience as well, Qwen via Ollama locally has been very very impressive.

reply

upvote

by saagarjha10 hours ago|

[-]

I have found that a lot of the techniques used to decensor models (as far as I can tell, they basically get all their weights to say no turned off) also make them really stupid. Like, sure, it will help you rob a bank, but if you ask whether you should rob the bank it will go "The positives: … The negatives: … My take: You should ABSOLUTELY rob the bank".

reply

upvote

by pmarreck7 hours ago|

[-]

The abliteration in particular that Heretic does apparently results in a best-in-class lack of "stupefying" the underlying model. You haven't read its claims, apparently.

reply

upvote

by lxgr8 hours ago|

[-]

I wonder if this is due to abliteration actually "damaging" the model, or just an artifact of the model never having been properly trained on "forbidden" topics (as it's enough for them to recognize them, and there's no point in dedicating neurons to something that will never be exercised anyway).

reply

upvote

by zozbot2347 hours ago|

[-]

Modern abliteration is quite good at not damaging the model on ordinary topics. But yes, on many of the weirdest "forbidden" topics (excluding the mild stuff like ordinary erotica) there's not going to be any real training of any sort and it's basically hallucinations running wild. You even see this claim repeated explicitly on every model release "safety card": 'no, this model does not have the sort of fiddly tacit know-how it would need to actually advise anyone nefarious on this dangerous stuff'.

reply

upvote

by magospietato1 days ago|

[-]

Haven't built anything on the agent skills platform yet, but it's pretty cool imo.

On Android the sandbox loads an index.html into a WebView, with standardized string I/O to the harness via some window properties. You can even return a rendered HTML page.

Definitely hacked together, but feels like an indication of what an edge compute agentic sandbox might look like in future.

reply

upvote

by bossyTeacher22 hours ago|

[-]

>there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.

Mind giving us a few of the examples that you plan to run in your local LLM? I am curious.

reply

upvote

by pmarreck20 hours ago|

[-]

I'm not sure what you're angling at but I already gave a set of questions that are ethically legitimate yet routinely censored by the public models:

https://news.ycombinator.com/item?id=47654013

Not to mention that doing what the big model makers do literally dumbs the model down.

They should at least allow something like letting you prove your age and identity to give you access to better/unaligned models, maybe even requiring a license of some sort. Because you know what? SOMEONE in there absolutely has access to the completely uncensored versions of the latest models.

reply

upvote

by satvikpendem18 hours ago|

[-]

I tried 1 and a few others with hypothetical situations, public models answer perfectly fine it looks like.

reply

upvote

by 3yr-i-frew-up19 hours ago|

[-]

[dead]

reply

upvote

by jackp961 days ago|

[-]

[flagged]

reply

upvote

by potsandpans1 days ago|

[-]

I'm tired of this concern trolling.

reply

upvote

by 23 hours ago|

[-]

deleted

reply

upvote

by 23 hours ago|

[-]

deleted

reply

upvote

by jackp9623 hours ago|

[-]

[flagged]

reply

upvote

by karimf23 hours ago|

[-]

This app is cool and it showcases some use cases, but it still undersells what the E2B model can do.

I just made a real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B. I posted it on /r/LocalLLaMA a few hours ago and it's gaining some traction [0]. Here's the repo [1]

I'm running it on a Macbook instead of an iPhone, but based on the benchmark here [2], you should be able to run the same thing on an iPhone 17 Pro.

[0] https://www.reddit.com/r/LocalLLaMA/comments/1sda3r6/realtim...

[1] https://github.com/fikrikarim/parlor

[2] https://huggingface.co/litert-community/gemma-4-E2B-it-liter...

reply

upvote

by dang14 hours ago|

[-]

Re-upped here:

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B - https://news.ycombinator.com/item?id=47652007

reply

upvote

by karimf14 hours ago|

[-]

Oh wow, that's awesome. Thanks a lot, dang!

reply

upvote

by storus16 hours ago|

[-]

That's cool! You can add SoulX-FlashHead for real-time AI head animation as well if you want to simulate a teacher.

reply

upvote

by karimf14 hours ago|

[-]

Thanks for sharing! I'm still torn about it. Sure it'll feel more natural if you have the AI head animation, but I don't want people to get attached to it. I don't want to make the loneliness epidemic even worse.

reply

upvote

by nothinkjustai23 hours ago|

[-]

Parlor is so cool, especially since you’re offering it for free. And a great use case for local LLMs.

reply

upvote

by karimf23 hours ago|

[-]

Thanks! Although, I can't claim any credit for it. I just spent a day gluing what other people have built. Huge props to the Gemma team for building an amazing model and also an inference engine that's focused for edge devices [0]

[0] https://github.com/google-ai-edge/LiteRT-LM

reply

upvote

by PullJosh1 days ago|

[-]

This is awesome!

1) I am able to run the model on my iPhone and get good results. Not as good as Gemini in the cloud, but good.

2) I love the “mobile actions” tool calls that allow the LLM to turn on the flashlight, open maps, etc. It would be fun if they added Siri Shortcuts support. I want the personal automation that Apple promised but never delivered.

3) I am so excited for local models to be normalized. I build little apps for teachers and there are stringent privacy laws involved that mean I strongly prefer writing code that runs fully client-side when possible. When I develop apps and websites, I want easy API access to on-device models for free. I know it sort of exists on iOS and Chrome right now, but as far as I’m aware it’s not particularly good yet.

reply

upvote

by buzzerbetrayed22 hours ago|

[-]

For me the hallucination and gaslighting is like taking a step back in time a couple of years. It even fails the “r’s in strawberry” question. How nostalgic.

It’s very impressive that this can run locally. And I hope we will continue to be able to run couple-year-old-equivalent models locally going forward.

reply

upvote

by dimmke20 hours ago|

[-]

I haven't seen anybody else post it in this thread, but this is running on 8GB of RAM. It's not the full Gemma 4 32B model. It's a completely different thing from the full Gemma 4 experience if you were running the flagship model, almost to the point of being misleading.

It's their E2B and E4B variants (so 2B and 4B but also quantized)

https://ai.google.dev/gemma/docs/core/model_card_4#dense_mod...

reply

upvote

by zozbot23420 hours ago|

[-]

The relevant constraint when running on a phone is power, not really RAM footprint. Running the tiny E2B/E4B models makes sense, this is essentially what they're designed for.

reply

upvote

by Shawnj22 hours ago|

[-]

Depends on the phone, I have trouble fitting models into memory on my iPhone 13 before iOS kills the app. I imagine newer phones with more RAM don’t have this issue especially with some new flagship phones having 16+ GB of memory

reply

upvote

by trvz16 hours ago|

[-]

It absolutely is RAM…

So much so that this was what made Apple increase their base sizes.

reply

upvote

by bigyabai13 hours ago|

[-]

Between the GPU, NPU and big.LITTLE cores, many phones have no fewer than 4 different power profiles they can run inference at. It's about as solved as it will get without an architectural overhaul.

reply

upvote

by 1f60c20 hours ago|

[-]

Strangely, reasoning is not on by default. If you enable it, it answers as you'd expect.

reply

upvote

by shtack17 hours ago|

[-]

With reasoning on I found E4B to be solid, but E2B was completely unusable across several tests.

reply

upvote

by janandonly23 hours ago|

[-]

OP Here. It is my firm belief that the only realistic use of AI in the future is either locally on-device for almost free, or in the cloud but way more expensive then it is today.

The latter option will only bemusedly for tasks that humans are more expensive or much slower in.

This Gemma 4 model gives me hope for a future Siri or other with iPhone and macOS integration, “Her” (as in the movie) style.

reply

upvote

by crazygringo23 hours ago|

[-]

> or in the cloud but way more expensive then it is today.

Why? It's widely understood that the big players are making profit on inference. The only reason they still have losses is because training is so expensive, but you need to do that no matter whether the models are running in the cloud or on your device.

If you think about it, it's always going to be cheaper and more energy-efficient to have dedicated cloud hardware to run models. Running them on your phone, even if possible, is just going to suck up your battery life.

reply

upvote

by mbesto23 hours ago|

[-]

> It's widely understood that the big players are making profit on inference.

This is most definitely not widely understood. We still don't know yet. There's tons of discussions about people disagreeing on whether it really is profitable. Unless you have proof, don't say "this is widely understood".

reply

upvote

by victorbjorklund8 hours ago|

[-]

You can look at open source models hosted by various companies that have no reason to host them on a loss.

reply

upvote

by mbesto6 hours ago|

[-]

Uber ran their ridesharing at a loss for years. This is a very common way to gain market share.

reply

upvote

by victorbjorklund4 hours ago|

[-]

What market share? We are talking commodity models where the host does not matter at all at OpenRouter etc.

reply

upvote

by BeetleB4 hours ago|

[-]

Uber had massive VC investment and a moat. The companies he's referring to likely don't have much VC investment and zero moat.

reply

upvote

by int_19h10 hours ago|

[-]

I recently had Codex working for 80+ hrs non stop (as in literally that was a single running session in response to a single prompt!).

Even at $200 monthly subscription that kind of stuff burns through tokens at a rate where it's very difficult to believe that they are even breaking even, never mind profit.

reply

upvote

by dominotw6 hours ago|

[-]

thats nuts. what was it doing for 80 hrs?

reply

upvote

by shinycode6 hours ago|

[-]

Probably asked what’s the Answer to the Ultimate Question of Life, the Universe, and Everything

reply

upvote

by igtt18 hours ago|

[-]

The reality is we can’t trust accounting earnings anyway.

We need to see the cash flows.

reply

upvote

by petesergeant12 hours ago|

[-]

I don’t have “proof” but the existence of so many providers of free models on OpenRouter strongly suggests inference is running at a profit. There’s no winner-takes-all angle to being a faceless provider there (often the consumer doesn’t know who fulfilled the request), so there’s just no incentive at all for these small provider companies to exist unless inference is profitable under the right conditions.

reply

upvote

by pixelispoint6 hours ago|

[-]

>but the existence of so many providers of free models on OpenRouter strongly suggests inference is running at a profit

I don't think it suggests a profit, but rather a _hope_ for a _future_ profit, and a commitment to a strategy that may or may not pan out. Capitalism rewards those who are early to the party and commit to their bit.

reply

upvote

by zozbot23423 hours ago|

[-]

The big players are plausibly making profits on raw API calls, not subscriptions. These are quite costly compared to third-party inference from open models, but even setting that up is a hassle and you as a end user aren't getting any subsidy. Running inference locally will make a lot of sense for most light and casual users once the subsidies for subscription access cease.

Also while datacenter-based scaleout of a model over multiple GPUs running large batches is more energy efficient, it ultimately creates a single point of failure you may wish to avoid.

reply

upvote

by janalsncm21 hours ago|

[-]

> It's widely understood that the big players are making profit on inference.

If you add in the cost of training, it’s not profitable.

Not including the cost of training is a bit like saying the only cost of a cup of coffee is the paper cup it’s in. The only way OpenAI gets to charge for inference is by selling a product people can’t get elsewhere for much cheaper, which means billions in R&D costs. But because of competition, each model effectively has a “shelf life”.

reply

upvote

by tybit18 hours ago|

[-]

At least Anthropic claims that they are profitable on a per model basis. But since both revenue and training costs are growing exponentially, and they need to pay for model N training today, and only get revenue for model N-1 today, the offset makes it look worse than it is.

Obviously that doesn’t help them turn a profit, until they can stop growing training costs exponentially.

So it’s really a race to see whether growth in revenue or training costs decelerates first.

reply

upvote

by tatrions15 hours ago|

[-]

[flagged]

reply

upvote

by jfoster15 hours ago|

[-]

They will always be training new models, so if training is expensive, that's just part of the business they are in.

Vast amounts of capital have been poured in, but they continue to raise more. Presumably because they need more.

Is the capital being invested without any expectation of ROI?

reply

upvote

by 21 hours ago|

[-]

deleted

reply

upvote

by huijzer23 hours ago|

[-]

Laptop/desktop could work. Most systems are on charger most of time anyway

reply

upvote

by jrflowers22 hours ago|

[-]

> It's widely understood that the big players are making profit on inference.

I love the whole “they are making money if you ignore training costs” bit. It is always great to see somebody say something like “if you look at the amount of money that they’re spending it looks bad, but if you look away it looks pretty good” like it’s the money version of a solar eclipse

reply

upvote

by skybrian21 hours ago|

[-]

The reason it matters is that if they are making a profit on inference, then when people use their services more, it cuts their losses. They might even break even eventually and start making a profit without raising the price.

But if they're losing money on inference, they will lose more money when people use their services more. There's no way to turn that around at that price.

reply

upvote

by drawfloat11 hours ago|

[-]

We don't even have any evidence inference excluding training is actually profitable.

reply

upvote

by victorbjorklund8 hours ago|

[-]

It is called sunk cost. The marginal cost is what sets the lower limit. They will always be able to sell at the marginal cost of inference.

reply

upvote

by nothinkjustai23 hours ago|

[-]

> It's widely understood that the big players are making profit on inference.

Are they? Or are they just saying that to make their offerings more attractive to investors?

Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.

Locally running models that are snappy and mostly as capable as current sota models would be a dream. No internet connection required, no payment plans or relying on a third party provider to do your job. No privacy concerns. Etc etc.

reply

upvote

by nl20 hours ago|

[-]

> Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.

Where on earth do people get this idea? Subscriptions that are based around obscure, vendor defined "credits" are the perfect business model for vendors. They can change the amount you can use whenever they want.

It's likely they occasionally make a loss on some users but in general they are highly profitable for AI companies:

> Anthropic last month projected it would generate a 40% gross profit margin from selling AI to businesses and application developers in 2025

and

> OpenAI projected a gross margin of around 46% in 2025, including inference costs of both paying and nonpaying ChatGPT users.

https://archive.is/aKFYZ#selection-1075.0-1083.119

reply

upvote

by nothinkjustai18 hours ago|

[-]

Both of those companies are losing hella money, dude just cuz they say they “expect” to be profitable doesn’t mean they are.

reply

upvote

by zozbot23423 hours ago|

[-]

You can pick models that are snappy, or models that are as capable as SOTA. You don't really get both unless you spend extremely unreasonable amounts of money on what is essentially a datacenter-scale inference platform of your own, meant to service hundreds of users at once. (I don't care how many agent harnesses you spin up at once, you aren't going to get the same utilization as hundreds of concurrent users.)

This assessment might change if local AI frameworks start working seriously on support for tensor-parallel distributed inference, then you might get away with cheaper homelab-class hardware and only mildly unreasonable amounts of money.

reply

upvote

by _pdp_22 hours ago|

[-]

If you can run free models on consumer devices why do you think cloud providers cannot do the same except better and bundled with a tone of value worth paying?

reply

upvote

by amelius23 hours ago|

[-]

A local model running on a phone owned and controlled by the vendor is still not really exciting, imho.

It may be physically "local" but not in spirit.

reply

upvote

by 0dayman23 hours ago|

[-]

this is not that first step towards your dream

reply

upvote

by kennywinker23 hours ago|

[-]

Did you really watch “Her” and think this is a future that should happen??

Seriously????

reply

upvote

by jfreds23 hours ago|

[-]

I don’t think OP’s point has anything to do with AI companions.

The big benefit of moving compute to edge devices is to distribute the inference load on the grid. Powering and cooling phones is a lot easier than powering and cooling a datacenter

reply

upvote

by kennywinker18 hours ago|

[-]

Local ai is probably a good direction, i agree. But there was a part of their point that had to do with ai companions: the bit where they say we are closer to “her”-like ai companions. That was the bit i was responding to.

reply

upvote

by satvikpendem19 hours ago|

[-]

What does what they said have anything to do with Her? Local LLMs are better than big corporations owning your data and offering LLMs for a huge cost.

reply

upvote

by kennywinker18 hours ago|

[-]

I get the local ai thing. I agree it’s probably a good direction. The bit that has to do with the movie “her” is the bit at the end where they are excited about “her”-like companions on our phones.

reply

upvote

by teolandon15 hours ago|

[-]

They literally mentioned Her 2013 at the end of their comment.

reply

upvote

by sambapa23 hours ago|

[-]

Torment Nexus sounds fun

reply

upvote

by kennywinker18 hours ago|

[-]

Watch out! We got an info hazard here! Danger danger

reply

upvote

by aninteger22 hours ago|

[-]

Having Scarlett Johansson's voice might not be so bad or even something less robotic.

reply

upvote

by 22 hours ago|

[-]

deleted

reply

upvote

by kennywinker21 hours ago|

[-]

That happened already, in typical ai fashion: blatant theft https://www.nbcnews.com/tech/scarlett-johansson-legal-action...

reply

upvote

by qingcharles4 hours ago|

[-]

You wouldn't download a Scarlett Johansson...

reply

upvote

by nothinkjustai18 hours ago|

[-]

How do you steal a frequency?

reply

upvote

by kennywinker18 hours ago|

[-]

Do you genuinely think a “frequency” is what makes a human voice recognizable?

That’s like using someone’s face in an app and then saying “how can you steal pixels?”

reply

upvote

by nothinkjustai18 hours ago|

[-]

How can you steal pixels?

Or rather, what does “ownership” mean? What does it mean to own light waves? What does it mean to own sound waves? Etc

reply

upvote

by kennywinker18 hours ago|

[-]

You can’t steal pixels or frequencies. But you can use someone’s image or their voice to sell your product without their permission.

You can get all existential about it if you want - I just know that if someone used my face or my voice to shill for a product without my permission i’d be pissed. I’m pretty sure you would be too.

reply

upvote

by nothinkjustai17 hours ago|

[-]

I’d be pissed if my code was used for training an AI too but that seems legal thus far…

reply

upvote

by 23 hours ago|

[-]

deleted

reply

upvote

by esafak21 hours ago|

[-]

Unfortunately, one man's dystopia is another's utopia.

reply

upvote

by jeroenhd1 days ago|

[-]

English version of the page: https://apps.apple.com/us/app/google-ai-edge-gallery/id67496...

Also on Android: https://play.google.com/store/apps/details?id=com.google.ai....

It's a demo app for Google's Edge project: https://ai.google.dev/edge

reply

upvote

by om25234516 hours ago|

[-]

Gemma4 works really slow on my android e2b model on Samsung galaxy s21 ultra. Atleast 20-30 sec to warm up and then reply.

reply

upvote

by jeroenhd8 hours ago|

[-]

Running LLMs is probably the first time I find that the SoC of that generation to lack. Even Google's underpowered Tensor CPUs make a huge difference when it comes to LLM performance.

You can check your settings for GPU acceleration, it's possible that enabling that makes a big difference.

From what I've found online the difference may also simply be Snapdragon versus Exynos GPU driver optimizations, in which case I don't think the performance can be fixed by anyone but Samsung. Others online seem to get decent performance out of the model on the S21 Ultra at the very least.

reply

upvote

by satvikpendem15 hours ago|

[-]

Needs a modern phone, local LLMs don't work well on older phones.

reply

upvote

by thepbone9 hours ago|

[-]

The bigger E4B model is pretty fast on my Galaxy S21 Ultra even with thinking enabled. Maybe GPU acceleration was not enabled?

reply

upvote

by jeroenhd8 hours ago|

[-]

I think there's quite the performance difference between the S21 Ultra (Snapdragon 888) and the S21 Ultra (Exynos 2100).

Qualcomm has optimized libraries for running LLMs on their chips that I don't believe Samsung has bothered with.

reply

upvote

by cobicobi14 hours ago|

[-]

need s24 ultra and above i think

reply

upvote

by ysleepy12 hours ago|

[-]

The S25 (edge) runs this very well. 29 tok/s for E2B.

reply

upvote

[-]

deleted

reply

upvote

by amai4 hours ago|

[-]

The cooperation of Apple and Google is going to crush the competition: https://blog.google/company-news/inside-google/company-annou...

The combination of Apples hardware and Googles software is unbeatable.

reply

upvote

by lemonish971 hours ago|

[-]

Isn't their competition (at least in the mobile space) google/android themselves?

reply

upvote

by bigyabai3 hours ago|

[-]

With Google's graveyard and Apple's walled garden, nothing can stop the enshittification train from trundling down the tracks.

reply

upvote

by inzlab4 minutes ago|

[-]

Impressive

reply

upvote

by rock_artist13 hours ago|

[-]

I really believe in the future of local models.

From app developer and user, My main concern for now is bloating devices. Until we’ll have something like Apples foundation model where multiple apps could share the same model it means we have something horrible as Electron in the sense, every app is a fully blown model (browser in the electron story) instead of reusing the model.

With desktops we have DLL hell for years. But with sandboxed apps on mobile devices it becomes a bigger issue that I guess will/should be addressed by the OS.

For my app I’ve been trying to add some logic based on large model but for bloating a simple Swift app with 2-3GB of model or even few hundred MBs feels wrong doing and conflicting with code reusability concepts.

reply

upvote

by janandonly9 hours ago|

[-]

This app unlocks using the Apple Foundation model itself: https://apps.apple.com/nl/app/locally-ai-local-ai-chat/id674...

reply

upvote

by hypersolo6 hours ago|

[-]

[dead]

reply

upvote

by lemonish971 hours ago|

[-]

I hope they add a web search tool to the agent skills too. Most of my llm usage on my phone are just quick lookups and search summarizations. Would love to do these with a local model rather than Google AI mode of any other cloud based inference tools.

reply

upvote

by al_borland17 hours ago|

[-]

I find it odd they are using the term “edge” to brand this, if it’s target is the general public.

I’ve been to a few tech conferences and saw the term used there for the first time. It took me a little bit to see the pattern and understand what it meant. I have never heard the term used outside of those circles. It seems like “local” would be the term average users would be familiar with. Normal people don’t call their stuff “edge devices”.

reply

upvote

by ycombinete11 hours ago|

[-]

Funnily enough I work in the security industry and the term is ubiquitous there, so I didn’t even notice it.

reply

upvote

by bigyabai14 hours ago|

[-]

> if it’s target is the general public.

It's not - Apple is working with Google right now to make Siri into the public-facing version of this. This is kinda just the tech preview before all the branding has been painted on.

reply

upvote

by 17 hours ago|

[-]

deleted

reply

upvote

by dhbradshaw22 hours ago|

[-]

My son just started using 2B on his Android. I mentioned that it was an impressively compact model and next thing I knew he had figured out how to use it on his inexpensive 2024 Motorolla and was using it to practice reading and writing in foreign languages.

reply

upvote

by orf10 hours ago|

[-]

I’d recommend locally.ai[1] - it’s really good and has a wide range of models. Also has shortcuts support.

1. https://apps.apple.com/gb/app/locally-ai-local-ai-chat/id674...

reply

upvote

by janandonly9 hours ago|

[-]

Thanks for the link. Gemma 4 also works in this app.

reply

upvote

by areys1 hours ago|

[-]

The use cases that open up when inference stays on-device are genuinely different. Health apps, journaling, anything where users are (justifiably) paranoid about their data leaving the phone — that's a big surface area that cloud APIs can't really touch. Surprised this is happening at the speed it is on consumer hardware.

reply

upvote

by allpratik22 hours ago|

[-]

Nice! Tried on iPhone 16 pro with 30 TPS from Gemma-4-E2B-it model.

Although the phone got considerably hot while inferencing. It’s quite an impressive performance and cannot wait to try it myself in one of my personal apps.

reply

upvote

by golem1418 hours ago|

[-]

It's at least somewhat limited in non-English content. It knows how to make lentil soup, so I was happy that I never need to look up recipe sites with awful UX and ads, but then it couldn't find a recipe for "Kalter Hund"/"Kalte Schnauze". So sad ;)

Still, absolutely fabulous. What a time to be alive!

reply

upvote

by mudkipdev14 hours ago|

[-]

It's strange that my iPhone 14 is at regular temperature when using the E2B model. But also it's a lot slower (not sure how to measure the exact tokens per second, ~12 if I had to guess)

reply

upvote

by TGower1 days ago|

[-]

These new models are very impressive. There should be a massive speedup coming as well, AI Edge Gallery is running on GPU, but NPUs in recent high end processors should be much faster. A16 chip for example (Macbook Neo and iphone 16 series) has 35 TOPS of Neural Engine vs 7 TFLOPS gpu. Similar story for Qualcomm.

reply

upvote

by api1 days ago|

[-]

That’s nuts actually for such a low power chip. Can’t wait to see the M series version of that.

I’m sure very fast TPUs in desktops and phones are coming.

reply

upvote

by zozbot23423 hours ago|

[-]

The Apple Silicon in the MacBook Neo is effectively a slimmed down version of M4, which is already out and has a very similar NPU (similar TFLOPS rating). It's worth noting however that the TFLOPS rating for Apple Neural Engine is somewhat artificial, since e.g. the "38 TFLOPS" in the M4 ANE are really 19 TFLOPS for FP16-only operation.

reply

upvote

by two_handfuls18 hours ago|

[-]

The description says it's private, but the legalese it makes you agree to makes no promise. Rather, the opposite:

> We collect information about your activity in our services

Source: https://policies.google.com/privacy#infocollect

reply

upvote

by mjlee10 hours ago|

[-]

I was about to ask if anybody had looked at what it was sending home. I’m travelling so I’m not in a position to run this through a proxy for a couple of weeks, but also I’m travelling so this could be useful!

reply

upvote

by bigyabai18 hours ago|

[-]

The app is open source[0], although given Apple's stance on sideloading it's hard to confirm if you're using the open version.

[0] https://github.com/google-ai-edge/gallery

reply

upvote

by selfsigned10 hours ago|

[-]

Two (very quick) minutes on their GitHub repo and it's pretty obvious that they're using firebase-analytics and at the very least seem to be sending URLs[1] and infos such as the model you download or the capacities[2] you use.

[1] https://github.com/google-ai-edge/gallery/blob/main/Android/...

[2] https://github.com/google-ai-edge/gallery/blob/main/Android/...

reply

upvote

by kaliqt4 hours ago|

[-]

That’s not the same app as posted which is for iOS.

reply

upvote

by selfsigned1 hours ago|

[-]

You're right, they reference IOS in the README.md and link to the GitHub repo from the Apple Store page yet don't include the IOS source, sneaky.

reply

upvote

by kaliqt4 hours ago|

[-]

That is the Android repo, where is the iOS repo?

reply

upvote

by haizhung6 hours ago|

[-]

I encourage everybody to try this, if they have an iPhone. If you’re like me and don’t have the time to tinker with the latest and greatest all the time; this app lowers the barrier to entry significantly and provides a glimpse into what’s possible locally, on device.

Honestly, I was extremely impressed by the speed and quality of the answers considering this thing runs on a phone. It honestly makes me want to sit down and spin up my own homegrown AI setup to go fully independent. Crazy.

reply

upvote

by deckar0123 hours ago|

[-]

It doesn’t render Markdown or LaTeX. The scrolling is unusable during generation. E4B failed to correctly account for convection and conduction when reasoning about the effects of thermal radiation (31b was very good). After 3 questions in a session (with thinking) E4B went off the rails and started emitting nonsense fragment before the stated token limit was hit (unless it isn’t actually checking).

reply

upvote

by 3abiton3 hours ago|

[-]

They have very limited capabilities compared to bigger more complex models, but for general stuff, they are fantastic. We need to set the expectations correctly of what they can do, I know lots of hype around Gemma 4, even though Qwen3.5 outperformed it. It's just a reliable overall small model, with great small model abilities.

reply

upvote

by _nagu_11 hours ago|

[-]

If this works smoothly on iPhone, it could change how we think about mobile apps. Less backend dependency, more on-device intelligence.

reply

upvote

by jcutrell6 hours ago|

[-]

This is what Apple promised a long time ago, and just couldn't quite connect on delivery.

reply

upvote

by hadrien011 days ago|

[-]

Is it me or does the App Store website look... fake? The text in the header ("Productiviteit", "Alleen voor iPhone") looks pixelated, like it was edited on Paint, the header background is flickering, the app icon and screenshots are very low quality, the title of the website is incomplete ("App Store voor iPho...")

reply

upvote

by lateforwork22 hours ago|

[-]

Here's the US version of the same page: https://apps.apple.com/us/app/google-ai-edge-gallery/id67496...

The design quality is still poor. But that's the new Apple. Design is no longer one of their core strengths.

reply

upvote

by giarc1 days ago|

[-]

It's the dutch version, see /nl/ in the url.

If you just go to https://apps.apple.com/ it does look better, but I agree, still a bit "off".

reply

upvote

by throwatdem123111 days ago|

[-]

Issues caused by a low effort localization?

On my iPhone it opens on the App Store app, so it looks fine to me.

reply

upvote

by piperswe1 days ago|

[-]

What browser are you using? I don't see any of this behavior on Firefox...

reply

upvote

by hadrien011 days ago|

[-]

Firefox on Windows, but it looks about the same in Edge

Screenshot of the header: https://i.imgur.com/4abfGYF.png

reply

upvote

by morpheuskafka1 days ago|

[-]

It looks like there is some sort of glow effect on the text that isn't rendering right on your browser? It arguably doesn't have the best contrast, but seems to be as intended in Safari 26.3. Looks similar on Chrome macOS too: https://imgur.com/yq5PrKm.

reply

upvote

by t-sauer1 days ago|

[-]

Renders equally weird for me on Firefox on Windows 11. Firefox on MacOS looks good though.

Edit: Seems like mix-blend-mode: plus-lighter is bugged in Firefox on Windows https://jsfiddle.net/bjg24hk9/

reply

upvote

by OJFord21 hours ago|

[-]

Firefox on Android: 'Google AI' (in app name) is clipped off the top; the Apple 'share' button is clipped on the bottom.

reply

upvote

by j0hax1 days ago|

[-]

Everything renders crystal clear with Firefox on GrapheneOS.

reply

upvote

by ezfe1 days ago|

[-]

Nothing weird on my side

reply

upvote

by burnto23 hours ago|

[-]

My iPhone 13 can’t run most of these models. A decent local LLM is one of the few reasons I can imagine actually upgrading earlier than typically necessary.

reply

upvote

by Gigachad16 hours ago|

[-]

I’ve got a 17 pro and tbh I haven’t found any use for local models yet. They are a neat curiosity but the online ones are absolutely massively far ahead. Considering they are being given away for free currently, it’s hard to justify not making use of them over dumber local models.

reply

upvote

by mchusma6 hours ago|

[-]

I’m expecting the new iPhone release this fall to be coupled with some great version of Siri/model. This could be the first reason I’ve seen to upgrade in a while (although even that I’m not sure of, as I am king of in the “always use the best model it’s worth it” camp.)

Apple has a great shot at making a highly optimized 4.5 version of this model highly tuned to the next gen iPhone, which could work great.

reply

upvote

by sshrajesh3 hours ago|

[-]

> Note: I tried to hook this one up to OpenClaw and ran into issues

Anyone worked on hooking up OpenClaw to gemma4 running locally?

reply

upvote

by carbocation1 days ago|

[-]

It would be very helpful if the chat logs could (optionally) be retained.

reply

upvote

by davecahill18 hours ago|

[-]

I really like Enclave for on-device models - looks like they're about to add Gemma 4 too: https://enclaveai.app/blog/2026/04/02/gemma-4-release-on-dev...

reply

upvote

by robbru6 hours ago|

[-]

I've been using Enclave ever since, they have been the best App Store option for a long time.

reply

upvote

by rudedogg18 hours ago|

[-]

This is fun, FYI you don’t have to sign in/up with a Google account. I hesitated downloading it for that reason.

reply

upvote

by satvikpendem18 hours ago|

[-]

This is also on Android and has an option to use AICore with the NPU which can run much faster than even the GPU models.

reply

upvote

by nout18 hours ago|

[-]

How do you get it running on Android?

reply

upvote

by satvikpendem18 hours ago|

[-]

It's the same app, Google AI edge gallery.

reply

upvote

by dwa359223 hours ago|

[-]

I think with this google starts a new race- best local model that runs on phones.

reply

upvote

by dwa359223 hours ago|

[-]

I wonder why the cut off date for 3n-E4B-it is Oct, 2023. That's really far in the past.

reply

upvote

by satvikpendem19 hours ago|

[-]

Because that's Gemma 3, not 4.

reply

upvote

by danielrmay17 hours ago|

[-]

I spent some time getting Gemma4-e4b working via llamacpp on iPhone and I'm really impressed so far! I posted a short video of an example application on LinkedIn here https://www.linkedin.com/feed/update/urn:li:activity:7446746... (or x: https://x.com/danielrmay/status/2040971117419192553)

reply

upvote

by thot_experiment21 hours ago|

[-]

Gemma 4 E4B is an incredible model for doing all the home assistant stuff I normally just used Qwen3.5 35BA4B + Whisper while leaving me with wayy more empty vram for other bullshit. It works as a drop in replacement for all of my "turn the lights off" or "when's the next train" type queries and does a good job of tool use. This is the really the first time vramlets get a model that's reliably day to day useful locally.

I'm curious/worried about the audio capability, I'm still using Whisper as the audio support hasn't landed in llama.cpp, and I'm not excited enough to temporarily rewire my stuff to use vLLM or whatever their reference impl is. The vision capabilities of Gemma are notably (thus far, could be impl specific issues?) much much worse than Qwen (even the big moe and dense gemma are much worse), hopefully the audio is at least on par with medium whisper.

reply

upvote

by derwiki8 hours ago|

[-]

I asked it about the “Altamont Free Concert” (exact name of Wikipedia article), and it’s been a while since I’ve seen an hallucination this rich. Doesn’t give me confidence to use it.

reply

upvote

by totetsu11 hours ago|

[-]

I have been looking at ARGmax https://www.argmaxinc.com/#SDK for running on apple devices, but not sure yet at whats involved in porting a model to work with their sdk

reply

upvote

by MysticOracle16 hours ago|

[-]

Crashes for me on a couple of different iDevices (2 generations behind) after only a few 2-3 chats. Probably not enough RAM.

Saw this one on X the other day updated with Gemma 4 and they have the built-in Apple Foundation model, Qwen3.5, and other models:

Locally AI - https://locallyai.app/

reply

upvote

by neurostimulant21 hours ago|

[-]

I'm able to sweet talk the gemma-4-e2b-it model in an iphone 15 to solve a hcaptcha screenshot. This small model is surprisingly very capable!

reply

upvote

by rcarmo12 hours ago|

[-]

This is fun. I just wish I could add more skills, the UX is too dumbed down but knowing there is a run_js tool there is a lot that can be done here.

reply

upvote

by XCSme22 hours ago|

[-]

Gemma 4 is great: https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

I assume it is the 26B A4B one, if it runs locally?

reply

upvote

by adrian1722 hours ago|

[-]

No, only E2B and E4B.

reply

upvote

by rotexo20 hours ago|

[-]

E4B is pretty good for extracting tables of items from receipt scans and inferring categories, wish this could be called from within a shortcut to just select a photo and add the extracted table to the clipboard

reply

upvote

by nickvec17 hours ago|

[-]

Extremely impressed by how fast responses are on iPhone 17 Pro Max. Can’t wait for this to be used for Siri’s brain one of these days (hopefully!)

reply

upvote

by gdzie-jest-sol10 hours ago|

[-]

I need normal server too in local network I can run chat in other device and 'counting' on iphone.

Second idea is input audio in other language, like Czech, Polish, French

reply

upvote

by modeless13 hours ago|

[-]

It's so ridiculous that Google made a custom SoC for their phones, touting its AI performance, even calling it Tensor, and Apple is still faster at running Google's own model.

Google really ought to shut down their phone chip team. Literally every chip from them has been a disappointment. As much as I hate to say it, sticking with Qualcomm would have been the right choice.

reply

upvote

by ulfw13 hours ago|

[-]

It runs very fast on my Qualcomm Elite Gen 5 SoC Oppo Find N6

reply

upvote

by allpratik13 hours ago|

[-]

How many tokens per second? Also, does it get warm/hot?

reply

upvote

by modeless13 hours ago|

[-]

If this Gemma tokenizer I found online is accurate then my Pixel 10 Pro XL is getting ~22 tok/s on Gemma 4 E2B using the NPU, vs. 40 tok/s is what people are saying the MLX version gets on iPhone.

Actually I found official performance numbers from Google saying iPhone gets 56 tok/s and Qualcomm gets 52. They don't even bother listing Tensor in their table. Maybe because it would be too embarrassing. Ouch! https://ai.google.dev/edge/litert-lm/overview

reply

upvote

[-]

deleted

reply

upvote

by Sharmaji00017 hours ago|

[-]

Still didnt release training recipe, data, methodology etc unlike deepseek. Mostly released to get developer ecosystem across their android built in ai. Still good and interesting, but not exactly philanthropic to the open source progress.

reply

upvote

by mc7alazoun21 hours ago|

[-]

Would it work locally on a Mac Pro M4 24gb? If so I'd really appreciate a step-by-step guide.

reply

upvote

by weberer20 hours ago|

[-]

These E2B and E4B models are very small so that they can fit into phones with around 8gb of RAM. You can get away with a much larger model. Just run:

    brew install ollama 

    ollama run gemma4:26b-a4b-it-q4_K_M

reply

upvote

by mc7alazoun10 hours ago|

[-]

Legend! Thanks heaps.

reply

upvote

by MagicMoonlight8 hours ago|

[-]

It seems really capable. A few more iterations of this and you won’t even need a subscription.

All it needs is web search so that it can get up to date information.

reply

upvote

by jdthedisciple12 hours ago|

[-]

it's Google, so is it really private?

remember, megacorps are dying for infinite amounts of analytics data

reply

upvote

by prism5611 hours ago|

[-]

Are there any alternatives for on device android llm that aren't google and/or more private?

reply

upvote

by classified12 hours ago|

[-]

When I saw it wants me to "agree" to Google's "privacy policy", I deleted the app on the spot.

reply

upvote

by rickdg23 hours ago|

[-]

How do these compare to Apple's Foundation Models, btw?

reply

upvote

by simonw23 hours ago|

[-]

So much better. Hard to quantify, but even the small Gemma 4 models have that feels-like-ChatGPT magic that Apple's models are lacking.

reply

upvote

by snarkyturtle23 hours ago|

[-]

AFM had a 4096 token context window and this can be configured to have a 32k+ token context window, for one.

reply

upvote

by Waterluvian20 hours ago|

[-]

I see a phenomenal opportunity for old phone re-use by arraying them in some dock and making them be my "home AI."

reply

upvote

by garff22 hours ago|

[-]

How new of an iPhone model is needed?

reply

upvote

by tithos20 hours ago|

[-]

Most of the models are not available. I’m guessing they will become available soon enough… At least I hope.

reply

upvote

by beeflet23 hours ago|

[-]

Isn't this already possible in a much more open-ended way with PocketPal?

https://github.com/a-ghorbani/pocketpal-ai

https://apps.apple.com/us/app/pocketpal-ai/id6502579498

https://play.google.com/store/apps/details?id=com.pocketpala...

reply

upvote

by lzzqrd21 hours ago|

[-]

Could you clarify what you mean by 'open-ended' in this context, since both initiatives are essentially open-source?

reply

upvote

by imadselka9 hours ago|

[-]

good model!

reply

upvote

by dzhiurgis23 hours ago|

[-]

I recently got to a first practical use of it. I was on a plane, filling landing card (what a silly thing these are). I looked up my hotel address using qwen model on my iPhone 16 Pro. It was accurate. I was quite impressed.

After some back and forth the chat app started to crash tho, so YMMV.

reply

upvote

by nightrate_ai9 hours ago|

[-]

[dead]

reply

upvote

by meidad_g23 hours ago|

[-]

[dead]

reply

upvote

by areys8 hours ago|

[-]

[dead]

reply

upvote

by Sukhesh-QA8 hours ago|

[-]

[dead]

reply

upvote

by micmcfly19 hours ago|

[-]

[dead]

reply

upvote

by ValveFan69695 hours ago|

[-]

[dead]

reply

upvote

by darshil20231 days ago|

[-]

[dead]

reply

upvote

by LeonTing101015 hours ago|

[-]

[flagged]

reply

upvote

by lol867530921 hours ago|

[-]

It’s gotta be free!?!? Right!?!? Oh oh wait

reply

upvote

by __natty__23 hours ago|

[-]

That's a great project! I just wondered whether Google would have a problem with you using their trademark

reply

upvote

by tech234a23 hours ago|

[-]

This is an app published by Google itself

reply

upvote

by yalogin19 hours ago|

[-]

Are these models open source? If so this is Google’s attempt to collect user data from their models.

reply

upvote

by int_19h10 hours ago|

[-]

How is Google going to collect user data from a locally running model?

reply

upvote

by yalogin7 hours ago|

[-]

If you do it yourself they don’t, that is why they are packaging into an app

reply