undefined

points

[-]

Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

by amelius11 hours ago|

parent|

[-]

I skimmed it, but I still wonder why (1) we still need a tokenizer for text, and (2) why the other modalities (audio/video) don't need one.

by sigmoid106 hours ago|

parent|

[-]

How do you think the other modalities are fed into the attention layers? The other modalities are tokenized as well, that's literally what these separate image/audio encoders created as output before feeding it into the main network. Tokenization is at its core just a tradeoff between sequence length and embedding size, so it will probably stay relevant as long as attention layers scale quadratically with sequence length.

by asim1 days ago|

parent|

prev|

[-]

That's a great explainer, thanks for sharing it.

by dofm1 days ago|

prev|

[-]

I would contend that the actual big story is the gallery app:

https://developers.google.com/edge/gallery

Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.

Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.

I am a major, major AI cynic, but I decided to be an informed cynic so I've been playing with local models for agentic work and a bit of CAD-to-image generation. I really quite like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn OpenCode without developing a cloud dependency. It writes fairly good code and it is helping me learn the things I want to learn at a pace that I prefer.

But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of the cloud business model, at least in the short term.

(Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)

by minimaxir1 days ago|

parent|

[-]

I had discounted Edge Gallery because it didn't support system prompts, but now it does so I will give it another go. I believe the implementation does use MTP since I got an update to Gemma-4-E4B on iOS indicating such, and on macOS it's very speedy.

However, on my 18GB RAM MacBook Pro, selecting Gemma-4-12B-it results in this error:

> The model "Gemma-4-12B-it' requires more memory (RAM) than is available on your device.

So yeah, my questions about the 16GB marketing copy are fair.

by dofm1 days ago|

parent|

[-]

Interesting; they may have fluffed up somewhere then.

(Though perhaps it'll squeeze in with a small context window? Not sure I understand that aspect yet)

It does seem to use MTP, yes, and it is quite quick — seemingly the underlying LiteRT stuff can do MTP with Gemma 4 and presumably MTP is a big part of the practicality picture here.

The system prompt thing was a surprise when I poked around.

by sureglymop19 hours ago|

parent|

prev|

[-]

Is the story that it's now also available outside of android? I've had this app on my phone for I believe about a year.

by dofm8 hours ago|

parent|

[-]

It has certainly not been well-publicised that it is available on Mac and iOS but you are right, likely I just missed this news.

The combination of these things, though, I still think is significant. It’s a product from an old-fashioned (!) FAANG that installs as easily as Chrome, downloads a model as easily as it could be, combines a chat interface with audio and video analysis/transcription, has a customisable system prompt, MTP, agent skills support etc.

Now, it is from Google so they could kill it when they get bored! But clearly this is local AI packaged in a really accessible format, and the model seems quite capable for its size. It is something Microsoft could do when they can really point to easy consumer hardware that can do it well. It’s certainly something Apple could do better with their distillations of Gemini under the Google deal.

I think a sane line of enquiry for a tech journalist is: 1) doesn’t this threaten the appeal of consumer-tier subscriptions to ChatGPT (which is a big part of OpenAI’s revenue plans), and 2) is it therefore not questionable that the buy-and-hold economics of DRAM, SSD and GPU products that OpenAI benefits from having provoked into causing ridiculous price increases is fundamentally anti-consumer?

by spott1 days ago|

prev|

[-]

This is just early fusion basically.

FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818

I've been waiting for something like this to be released since then.

The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).

by santiagobasulto1 days ago|

parent|

[-]

I don't think it's the same. It's a similar concept, but Gemma is using just a linear projection, which I assume is a lot faster. The developer guide has more details: https://developers.googleblog.com/gemma-4-12b-the-developer-...

    Vision embedder (35M parameters): Replaces the 27 vision transformer layers of the other medium-sized Gemma 4 models. Raw 48x48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input

the "single matmul" is the key here, I haven't tried it, but it's probably pretty fast and memory efficient.

by ahmadyan22 hours ago|

parent|

prev|

[-]

Some of the FAIR people moved to Thinky, and they also started doing encoder-free MM-LLMs. Now Google. This seems to becoming a trend working at small scale, but the difficult part is scaling.

Standard approach for training MM-LLMs is we train the encoder first, there are O(2-10B) good images on the internet, so encoder needs to see each image O(10-100) times, that is O(100T) tokens, which is more than the entire pre-training budget for most runs. That is the reason we train the encoder separately (smaller model, 2B active vs 30B or 200B active LLM); there is nothing magical about training the encoder and LLM together, it is just more token-efficient to train the image modality first.

by jszymborski1 days ago|

prev|

[-]

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

by minimaxir1 days ago|

parent|

[-]

In hindsight I may have been pedantic.

by wilkystyle1 days ago|

parent|

[-]

I had a similar thought to you, and found your question and the resulting discussion helpful!

by santiagobasulto1 days ago|

parent|

prev|

[-]

Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.

by alberto4671 days ago|

parent|

prev|

[-]

Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

by altruios1 days ago|

parent|

[-]

Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.

After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.

Tokens create and hide too many problems to be the 'optimal' solution.

by selectodude23 hours ago|

parent|

[-]

Not to be too snarky but there’s a few trillion dollars and some of the brightest minds of our generation working on this. I’m sure there’s a reason why they’ve settled for or are stuck on tokenization.

by andai22 hours ago|

parent|

[-]

Yeah, I'm sure we ended up with JavaScript for great reasons too.

by TeMPOraL21 hours ago|

parent|

prev|

[-]

> making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language

Your problem isn't with tokens, but with "language". Tokens have little to do with language, other than usually being consumed in sequence, but that's true of anything that has to span over time. Thinking of tokens as letters or subwords is mistaking the general with the specific. We may have started with letters and words and subwords (trying to find the best balance for training), but then people figured why not add pixel patches to the dictionary, and then sounds, and then other signals, and after iterating on it a bit, we now have image and sound and symbol sequence data all being part of the same token space.

LLMs stopped being about "language" - in the sense of English, or C++ - long, long time ago. We're still using tokens, but they're more like quanta of sensory input now.

You can take it in two directions, I guess - either consider "Large Language Model" to be an anachronym, a name that couldn't keep up with times, but we got used to it back when it made sense, or alternatively, just broaden your understanding of "language" to encompass any pattern of quantized sensory inputs, regardless of modality :).

(Given how we know humans can communicate with pictures, gestures, body language, noises, movement, actions, or even gaze, and that when it becomes common enough, such systems develop their own pattern structure - dare I say vocabulary and grammar - and that none of it requires or usually involves going through a "normal language" intermediary - I'd lean towards the second direction :)).

ETA: also wrt. "thinking with tokens", LLMs don't really think in tokens. You may have heard that phrase, that may have been coined by Karpathy, that "for LLMs, tokens are units of thinking". It's a useful shorthand to remind people that prompting models to be terse and skip prose is effectively dumbing them down, but it's also a bit misleading.

A better analogy is that tokens act like clock signals: each consumed token causes certain amount of computation happen in the network, much like a single clock signal in digital electronics, or turning a crank one revolution in a mechanical contraption. This makes tokens "units of thinking" in the sense that processing N tokens causes M amount of computation to happen. Now, for whatever problem you're solving, there is a minimum amount X of computation that is required to solve in correctly, and it's mathematically impossible to do with less. So if you ask an LLM to solve it, it needs to process at least as many tokens as it takes for M = X. If you force the model to be so terse that it makes M < X, you literally make it impossible to succeed. In practice, you need M >> X.

by altruios21 hours ago|

parent|

[-]

Can you elaborate more on what a token looks like as a pixel patch/sound/general signal as it currently is (in this model)?

My understanding of pixel representation is: slice a grid in an image, each square slice gets projected into a number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric) and AGAIN gets passed into "position detector" which outputs a token representing that pixel/position. which gets passed into the lmm (at a significantly reduced/translated signal into token space).

First, before continuing: do I have that mostly correct?

by yorwba10 hours ago|

parent|

[-]

> number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric)

There is no such projection step. The array of x numbers is the token. For text, there is a one-to-one correspondence between the textual representation of a token, its index in the vocabulary of the model, and the array of x numbers that is fed into the linear algebra of the model, so people often equivocate between them; but for images or sound, there is no discrete vocabulary and no textual representation, only the array of x numbers.

by refulgentis23 hours ago|

parent|

prev|

[-]

This sounds like when crystal people talk quantum physics.

by CamperBob222 hours ago|

parent|

[-]

I agree with the GP. The idea that there's not a better intermediate representation between tokens and embedding vectors seems absurd. But how to arrive at such a representation and implement it effectively is a few zeroes above my pay grade.

by refulgentis22 hours ago|

parent|

[-]

I find your agreement seductive because it side steps the unfounded assertions and simply asserts there must be something different and we don’t know it, which is easy for me to agree with too. Or maybe hard to disagree with.

by cortesoft21 hours ago|

parent|

prev|

[-]

Being pedantic isn't a bad thing in technical discussions.

by kristjansson1 days ago|

prev|

[-]

> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

by magicalhippo22 hours ago|

parent|

[-]

Smaller models are less forgiving to quantization. For a 12B model I wouldn't expect Q4 to be "pretty close", unless it underwent quantization aware training (QAT). Of course it's not set in stone, there's a huge variance between models, so this might surprise.

by mchinen1 days ago|

prev|

[-]

The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

by make31 days ago|

parent|

[-]

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

by neosat1 days ago|

parent|

[-]

Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

by aesthesia21 hours ago|

parent|

prev|

[-]

Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.

by mchinen1 days ago|

parent|

prev|

[-]

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.

by pseudollm20 hours ago|

parent|

prev|

[-]

No there isn't - read the paper. It's just 40msec raw audio samples. Multiplied by one matrix to translate to 3800 input vector. That's it. The next 40 msec are fed in the next transformer input step. Without any positional encoding. Repeat ad infinitum

by matja1 days ago|

prev|

[-]

One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.

by lambda1 days ago|

parent|

[-]

It's not? There's an mmproj in the GGUFs released by ggml-org: https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...

From the visual guide, there's still the 35M parameter embedder, then the linear projector, for vision, and the linear projector for audio, so it does have some parameters used for the multimodal input to project it into the LLM latent space: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

And the Unsloth quants, which are missing this, don't support multimodal input. (edit: actually, I may have just needed to update my llama.cpp, will check with an updated llama.cpp soon)

I'm downloading the ggml-org GGUFs now, I tried Unsloth but got some weird problems, double checking with the bf16 model to see if the issue was just the quant.

by lambda18 hours ago|

parent|

[-]

Ah, Unsloth has uploaded mmproj now as well.

by pferdone1 days ago|

parent|

prev|

[-]

But do I have the option to run it 'text only'?

by mips_avatar1 days ago|

prev|

[-]

I don't think we've bottomed out on what we can do with embedding models. They're these tiny models that absolutely rip on modern cpus with 8 bit int optimizations. Like in my app we can say pretty definitive things about hundreds of millions of places in the world on retrieval tasks on regular hardware.

by teravor22 hours ago|

prev|

[-]

I dont see how encoder free audio isnt a mistake here. a mimo model will at least get the audio to 12.5 Hz as opposed to the 25 Hz they are doing. and you dont need to finetune mimo either.

by woadwarrior011 days ago|

prev|

[-]

There are many priors to encoder-free VLMs. I specifically remember the EVE series of models from ~2 years.

https://github.com/baaivision/EVE

by wolttam1 days ago|

prev|

[-]

I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.

by rao-v1 days ago|

prev|

[-]

Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model

by reactordev1 days ago|

prev|

[-]

It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.

by goobatrooba1 days ago|

prev|

[-]

Either Google changed the text or you editorialised it a tiny bit - just for all others that got excited, they mean 16GB VRAM. So a premium graphics card requiring a >2500€ device is the minimum to run this.

Still progress, but not quite democratic yet.

Weird though that Google might be cannibalising it's own AI subscription service?

by LoveMortuus22 hours ago|

parent|

[-]

I've bought a laptop for <1500€ that came with 32GB of RAM and an RTX 3080 with 16GB or VRAM. So I don't think >2500€ device is necessary, though I'm certain it would yield better and faster results.

by spider-mario2 hours ago|

parent|

prev|

[-]

Or a MacBook Air with unified memory?

by thot_experiment1 days ago|

parent|

prev|

[-]

I haven't tried this model yet, but I can run Gemma 31B w/ the MTP drafter in pure CPU at about 10tok/s so this should run at about 20-30tok/s on a decent CPU, it'll probably run at >50tok/s on any Mac that can fit it, and lots of people have a gaming GPU with enough VRAM. In terms of access to hardware being a gate, it's one you can hop pretty easily.

by dofm1 days ago|

parent|

[-]

Could you outline how you are running the MTP drafters? I've tried LM Studio but no dice there. I'm probably missing something but I think llama.cpp and Ollama can't do it yet either?

by thot_experiment22 hours ago|

parent|

[-]

I just build llama.cpp from scratch on the PR that has MTP drafters.

https://github.com/ggml-org/llama.cpp/pull/23398

Please don't use Ollama, it's a bad actor in the OSS community.

by dofm22 hours ago|

parent|

[-]

I don't have the energy to build stuff all the time, that's a rabbit-hole side tunnel I don't really want to get into. I have larger concerns in my life that are more urgent than developing that side of things.

But I've moved on from Ollama for the time being, though I am mainly interested to see what the Gemma 4 MTP speeds are like on my M1 Max, so I may test it.

I am quite impressed with the tools in LM Studio, which is also a beautiful app, but it is not open source (which challenges my personal strategy somewhat) and I dread its inevitable enshittification.

Nevertheless the GUI has been very helpful while I learn, and I will probably use it until something else presents or my usage pattern settles down from experimentation to something a bit more routine.

I will try oMLX, too, but judging by the LiteRT page I may soon be able to just use that for the larger models if I end up settling with Gemma 4.

by thot_experiment22 hours ago|

parent|

[-]

Totally understandable. YMMV but I found the llama.cpp build process to work on the first try on my machine, and it only takes a couple minutes, which definitely isn't my usual expectation or experience. I was very pleasantly surprised. Their web-ui is also getting very polished while still doing a great job of letting you tweak all the weird settings.

by dofm21 hours ago|

parent|

[-]

Sorry, I sounded a bit terse there!

You have probably convinced me to give it a try, to be honest.

It's just that, to cut a long story short, I am currently recovering from a level of burnout so severe that twelve months ago had me fully convinced I was actually in early-onset cognitive decline (I am a bit over fifty).

Only a little over two months ago I was still sure I'd have to quit IT and find a slow job because I was so out of the loop; this whole industry shift even in just the last few months is so shocking and strange.

So I have to be a bit cautious about how many indirections I add, if that makes sense. But I am compiling bigger projects than llama.cpp so I will give it a go.

Thank you for the extra detail.

by Patrick_Devine23 hours ago|

parent|

prev|

[-]

I haven't yet pushed the MTP enabled gemma4 12b model for Ollama because in my testing I wasn't getting a performance bump. The other gemma4 MTP models should work OK right now, but there are some fixes we're just about to push. This is specifically for the MLX backend.

by dofm23 hours ago|

parent|

[-]

Thanks for your reply. I will go back and look at Ollama again.

So much to learn but this news has really vindicated my decision to direct my limited span of concentration and focus to learning how to use open weights models and opencode.

by ch_sm23 hours ago|

parent|

prev|

[-]

can‘t speak to compatibility with this new model, but oMLX supports MTP drafters very well.

by dofm23 hours ago|

parent|

[-]

Thank you, I will test that.

by ActorNightly22 hours ago|

parent|

prev|

[-]

Google is an advertising company first and foremost. At some point, these local models have to fit into that umbrella. I don't quite know how yet, but its going to happen.

That being said, the real value in paid plans is that you get ecosystem integration that can read your gmail, photos, docs, and so on.

by bitexploder21 hours ago|

parent|

[-]

Google is also a Cloud Provider. Cloud is now ~18% of Google. While it is an advertising juggernaut. Cloud is also rapidly growing, so the local models simply fit as AI research and dev and getting more people on Gemini models. They /are/ advertising, effectively :)

by hattimaTim17 hours ago|

parent|

[-]

I wish they were :) But the gemini models are so unstable in API that I can not even use them for production.

by jpadkins21 hours ago|

parent|

prev|

[-]

local models still need information retrieval.

by GaggiX1 days ago|

prev|

[-]

> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

by minimaxir1 days ago|

parent|

[-]

The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."

12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that. EDIT: On my 18G memory MacBook Pro, LM Studio reports a "partial GPU offload" for the int8 MLX weights. Can't test because the `gemma_unified" architecture is NYI.

by WhitneyLand1 days ago|

parent|

[-]

Yeah and it’s pretty memory efficient with only 8 attention layers so at int8 in 16GB ram maybe you still get 64k-128k context.

The part I hate though is that I’d bet none of the performance claims are based on int8.

Why do we care about bf16 benchmarks when no one will be using that with this model.

by WhitneyLand1 days ago|

parent|

prev|

[-]

I don’t think so, the HF weights are bf16 which means 24GB + cache/overhead.

It sounds like marketing spin where the performance claims are based on BF16 and the “runs in 16GB” claim is on a totally different quantized version.

by Pixel-Labs23 hours ago|

parent|

[-]

[flagged]

by madduci1 days ago|

prev|

[-]

VRAM, not RAM. I wish it was light enough for iGPUs too

by KiwiJohnno15 hours ago|

parent|

[-]

I ran the 26B model on my i5 which has no discrete graphics card. It ran about 7 tokens/sec, and appreared to be a very capable model.