Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

[-]

Not to mention the text-only 0.8GB version. Just crazy. You can have basic real-time conversations on-device that's video and audio aware now.

by yalok14 hours ago|

[-]

0.8GB is for text only. It's more like ~1.1GB if you include video/audio encoder

by reactordev1 hours ago|

[-]

And your point is what? That’s more than 0.8GB text only if you include more than, text-only?

by simonw20 hours ago|

[-]

Have you seen a 0.8GB model file floating around yet? I couldn't find one earlier.

by reactordev18 hours ago|

https://huggingface.co/google/gemma-4-E2B-it-qat-mobile-ct

[-]

I think this is the one but it’s 0.8GB VRAM not 0.8GB size.

But they could be cooking up a smaller one because the model card lists the Q_4 quants as being bigger than the mobile or text-only so I think we’ll need to wait for the Q_2_Distilled_Mobile_Textformer version. Still, just amazing work.

by viccis11 hours ago|

[-]

I'll be honest with you. My main ask for on device AI is that when I am typing "Going out for a quick j" it corrects to "jog" and not "Jonathan". I don't think it needs that many gigabytes.

by taffydavid10 hours ago|

[-]

Who doesn't enjoy a quick Jonathan now and then.

But seriously, wouldn't productive text on a 90s cell phone pass this test?

by reactordev6 hours ago|

[-]

The autocomplete of a decade ago is better than what we have now.

It’s harder now because emojis and draw-to-type as well as pen input. We didn’t have these things 14 years ago when “I’ll be right back” could be expanded from “I’ll b ri ba”

by madduci12 hours ago|

[-]

Where is it? On ollama I see only the bigger one

by reactordev1 hours ago|

[-]

I don’t use ollama, can you pull from HF?

by rcarmo18 hours ago|

[-]

Is that actually QAT? the MLX Community models have that in their names, but these don't, and the upload dates don't quite line up.

by __mharrison__18 hours ago|

[-]

As an aside uvx is so pleasant to use... I wish Nvidia supported it as first-class rather than making folks jump through Docker hoops.

by NamlchakKhandro15 hours ago|

[-]

I wish people would stop using python sure ai.

It's slow and the PKG resolution is way too flat.

by qwertox10 hours ago|

[-]

What do you use?

by satvikpendem1 days ago|

[0] https://huggingface.co/collections/unsloth/gemma-4-qat

[-]

Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.

Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.

[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

by llmoorator1 days ago|

[-]

you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.

meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

Like storing small 8 bit numbers in full 32 bit integers.

So it's not close to 100% of unquantized BF16.

I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.

That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...

by coder54322 hours ago|

https://developers.googleblog.com/en/gemma-3-quantized-aware...

[-]

> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.

The Gemma 3 QAT report was a bit clearer:

"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."

The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.

by 3abiton18 hours ago|

[-]

Are there evidence that this approach helps maintain "accuracy" performance when quantized? It sounds a bit like mxfp4 with gpt-oss, which was a confusing model upon release.

by dofm1 hours ago|

[-]

I have just been humbled by the Gemma 4 26B QAT build (unsloth's version), which insisted repeatedly that I am wrong in my requirements for some niche wordpress code, which cannot be satisfied.

I am a good WP developer so I kept prodding it and it kept insisting, and it explained with clarity. Turns out it is right and I was wrong, as I would have found out if I'd written the code myself.

I've been using this particular test for days, experimenting in ways to generate and prompt code. The 4-bit quantisation of the pre-QAT model does not catch this error. And nor can the Qwen 3.6 sparse model, which confidently blazed past it and never mentioned it.

(FWIW neither did plain ChatGPT; maybe Codex would)

Anecdotal, but there you go. I am somewhat weirded out by it.

by ComputerGuru18 hours ago|

[-]

So what we want now is unsloth (or anyone) to release 4/6-bit quantized models of these releases?

by coder54318 hours ago|

[-]

Yep, Unsloth already did, as linked in the comment at the top of this thread

by satvikpendem22 hours ago|

[-]

Ah I see, thanks for the clarification.

by mft_10 hours ago|

[0] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

[-]

Is this [0] saying that unsloth's versions of Google's QAT models are better than Google's own QAT models? Or am I not understanding it correctly?

by ComputerGuru2 hours ago|

[-]

It's saying it's better than naively truncating the QAT release to 4 bits.

by scosman16 hours ago|

[-]

Google's QAT claims to need 6.7 GB RAM, vs Unsloth's dynamic quants at 8GB. Would love to see some benchmarks. Both amazing for size.

by slopinthebag1 days ago|

[-]

I'm confused, the unsloth model is ~600mb and the one from google is 7gb?

by overfeed22 hours ago|

[-]

One is quantized, the other one is Quantization-ready.

by jhatax20 hours ago|

[-]

It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple?

No knowledge, just speculation.

by illusive408014 hours ago|

[-]

As an amateur app dev using on device AI: If they replace Apple Foundation model with Gemma 4 I would be so happy.

by itake2 hours ago|

[-]

I’m curious like what performance if we met you would expect and why?

by trollbridge15 hours ago|

[-]

Maybe Siri will become capable of doing what I can do on my Mac with llamafile and a few minutes of work...

by jbarrow21 hours ago|

[-]

Very impressed with how much the Gemma ecosystem has advanced just this week.

Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!

by minimaxir1 days ago|

[-]

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.

Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

by Aurornis1 days ago|

[-]

I'm not sure why you think it's awkward to have multiple releases. It's better to release models and variations as they're ready, not withhold them all until everything is ready to release all at once.

The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.

by 22 hours ago|

[-]

deleted

by netdur1 days ago|

[-]

not sure if I understand you, but 4Q and QAT 4Q are different

by refulgentis1 days ago|

[-]

It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?

- Gemma 4 2B/4B/27BE3B/31B

- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)

- Gemma 4 12B (2 days ago? 1?)

- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)

It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.

Extremely glad for the output, not glad to have to chase it.

ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.

Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)

EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!

by ddarolfi1 days ago|

[-]

These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.

by sumedh17 hours ago|

[-]

> I believe it's on you to understand it.

This is exactly why Google has 10 messenger Apps.

by nolist_policy10 hours ago|

[-]

Google released their latest messenger app 9 years ago. https://en.wikipedia.org/wiki/Google_Chat

by refulgentis23 hours ago|

[-]

I understand it. :)

And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)

by overfeed22 hours ago|

[-]

You can stop after the first one. Choosing to repeat the process is on you, and probably because you see some benefit in using the variant(s) you build on top of.

by ddarolfi22 hours ago|

[-]

Yes my framing was a little confusing. You were clear in that you are building products on them. I was more saying that because these gemma models are not products, and instead research outputs, the naming scheme should be more scientific rather than consumer friendly.

by satvikpendem1 days ago|

[-]

Just use Unsloth Studio it supports them all.

by taffydavid10 hours ago|

[-]

Noob q: can advancements like this targeted at local inference have bonus effects for cloud inference? Presumably if you can get great results on cheaper hardware that also equates to less resource usage on cutting edge hardware, and less power draw?

Will advancements like this ultimately reduce the carbon footprint of AI?

by goldenarm9 hours ago|

[-]

Consumer and server hardware are quite different, especially Google's TPUs. They notably have much larger mixture-of-experts ratios and more complex caching systems. At such scale and inference budgets, they are incentivised to optimize as much as possible.

Also Google Deepmins has a six month embargo on strategic papers, so I bet the juiciest quantization tech isn't public yet.

by RandyOrion12 hours ago|

[-]

From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models.

Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 600000, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty.

In contrast, for qwen 3.5 family, the largest activation is below 2000, making q8 or even lower-precision kv cache essentially free estates. Together with linear attention, which doesn't require kv cache, full 262k context window can be easily reached.

Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all.

In the end, a qat is a qat, and there are unseen efforts behind qat checkpoints. Thank you gemma team for releasing qat checkpoints.

by RandyOrion12 hours ago|

[-]

More rants about local inference, consider yourself warned.

Together with bf16 related deliberate hardward degrades on consumer-level nvidia gpus, i.e., gtx 10, rtx 20, 30, 40, 50 series, things gets sour really quickly.

by arjun-mavonic1 hours ago|

[-]

Yet to try this. But from what I heard from a friend is that Gemma 4 12b calls same tool’s repeatedly. Maybe harness can be made to handle it.

by Catloafdev21 hours ago|

[-]

Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.

by netdur1 days ago|

[-]

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI

The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

by ComputerGuru18 hours ago|

[-]

How were you getting anything useful out of that? We found the (unquantized!) E2B model to be completely useless at even the simplest real-world classification tasks.

by prism5621 hours ago|

[-]

How do you know it swaps to ram vs on the TPU?

Would be interested in testing this on my pixel.

by netdur4 hours ago|

[-]

Because TPU has 2GB and weight + context needs more

by jack_pp17 hours ago|

[-]

Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast

by WhiteDawn23 hours ago|

[-]

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

by pfheatwole19 hours ago|

[-]

Models:

- Safetensors: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...

- GGUF: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/...

Note the README in the Unsloth list of files: llama.cpp is working on a PR to support the gemma4 drafters: https://github.com/ggml-org/llama.cpp/pull/23398. Also note the PR submitter didn't experience much speedup with 26B (seems typical that MoE models don't generally benefit from MTP).

by dist-epoch23 hours ago|

https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...

[-]

Google already did

by dofm22 hours ago|

[-]

This is safetensors. Is there any way to run these on a Mac paired with the MLX QAT?

(Pardon my ignorance; this stuff moves so fast)

by thangalin20 hours ago|

https://point.free/blog/gemma-4-on-a-2016-xeon/

[-]

Did you see this?

Xeon, but could be useful for MTP on Mac.

by dofm19 hours ago|

[-]

I hadn't seen this, thanks.

I do have the Qwen 3.6 (35B) MTP implementation running (in LM Studio; it doesn't need a separate drafter), along with non-MTP Gemma 4 26B, and I can see that Unsloth Studio can run the new QAT, but I can't see how you can run the assistant/drafter. Yet.

It's just a constantly changing landscape. Don't get me wrong, it's fascinating and for various reasons I am pleased I can keep up even slightly, but eeeehhh :-)

by int_19h18 hours ago|

https://huggingface.co/lmstudio-community/gemma-4-26B-A4B-it...

[-]

by dofm12 hours ago|

[-]

Yeah — that is the base QAT model, and there are safetensors weights for the QAT version of the MTP drafter, but there are no MLX/GGUF versions. I think the answer is a combination of:

1) Gemma 4 MTP is too fresh for off-the-shelf software to use anyway

2) "you can convert them yourself" which is fine, obvs

by somewhatrandom923 hours ago|

[-]

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

by dist-epoch23 hours ago|

[-]

Google already released specialized drafters for Gemma 4.

by Havoc18 hours ago|

[-]

The E2B ones? Or what do you mean by specialized drafters?

by int_19h16 hours ago|

[-]

They have -assistant in the name, so e.g.: https://huggingface.co/google/gemma-4-31B-it-assistant

by Havoc9 hours ago|

[-]

Thanks

by girvo17 hours ago|

[-]

The “-assistant” models released by Google are specialised tiny MTP draft models :)

31b-it-assistant is what enables MTP

by 18 hours ago|

[-]

deleted

by nicman2310 hours ago|

[-]

the new 4 12b model replaced qwen3.6 27b for me. the task i am doing is a bit specific, validating if a stamp has the correct name but the ones that it could not see maybe a 30 percent were easily discerned.

by superkuh15 hours ago|

[-]

I wish they would release the base (non instruction tuned) models for use with pattern completion.

by llminthefor9 hours ago|

https://huggingface.co/collections/google/gemma-4

[-]

they did

by cr3cr31 days ago|

[-]

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

by razighter77722 hours ago|

[-]

Same I had to do a double take. Would be pretty humourous if they somehow took advantage of crypto offloading to accelerate ai inference

by nazgul1717 hours ago|

[-]

I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?

by zkmon22 hours ago|

[-]

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

by SubiculumCode13 hours ago|

[-]

I may be wrong, but this is what I figured out. Google provided these quantize-ready models, but they do not come pre-quantized. However, to produce their benchmarks, they quantized their model using the standard quantization approach. Unsloth has an advanced quantization method that performs better than the standard quantization, so the evals are better for unsloth quants.

by 23 hours ago|

[-]

deleted

by Kylejeong2117 hours ago|

[-]

google pixel intelligence may beat apple intelligence

by redox9923 hours ago|

[-]

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.

Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.

Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

by ilaksh21 hours ago|

[-]

I think that's probably true for the vast majority of Android phones. But if you have a SOTA expensive beast, I wonder if Gemma 4 12B at 4 bit could work? Maybe something like a Redmagic 11 pro or OnePlus 13 running NanoClaw?

But also maybe a few Qwen 3.6 or Qwen 3.5 variants can fit and can handle some simple tasks.

by redox9921 hours ago|

[-]

I think Gemma 4 12B is definitely possible to run on high end phones, google claims you need 16GB of memory. But it's probably not very usable, you'll need to swap most stuff other than the LLM.

When I tried E2B and E4B with Google Edge Gallery, and added a web search skill from the skill list, E2B would fail (get stuck in a loop), E4B would need a very specific instruction, "weather in [city name]" would not call the web search tool, I'd need "web search weather in [city name]". And the result was completely hallucinated and impossible. It claimed 14c and feels like 4c (which is impossible), and 10% humidity (which is almost impossible in this city)

Asking wikipedia level history questions (without any tool use), the results were awful as well.

[-]

I'm running a service in production using Gemma 4 models, to get structured JSON output back from web search tool calls using Unsloth Studio and its API, but it did require a rather large and detailed system prompt and tool call healing if the format wasn't JSON for example (retries, reprompting with feeding the error back into the model, etc, this is also what Unsloth Studio does for its self-healing tool call feature). But once I did that, it's been working quite well and on benchmarks I've made, it's about 97% accurate after the first time and basically 100% accurate after retries.

This is running on a server though, not sure how well it'd work on a phone, I should try that. I used AI Edge Gallery on Android and it doesn't seem too good at the web search tool but maybe the web search tool itself, being a community made tool, is pretty bad, because tool calling via Unsloth Studio seems to work just fine with the exact same Gemma models on desktop/server vs the phone.

by redox9920 hours ago|

[-]

I agree that the web search tool probably is pretty bad. However a smart model would never hallucinate impossible weather data if the search tool failed.

I'm sure you can get some out of it if you babysit it with an optimized prompt, harness, etc and you can tolerate some failures. But when I try to run the ChatGPT prompts from my history, even if I pick the easier ones, it's hopeless.

I'd like to have a local agent on the phone with wikipedia level knowledge. But you probably need more like 30B params.

[-]

I use the 4B on my phone and it seems to work fine without tool calls. So it's definitely an issue with that and not the model itself. I'll play around and see if I can fix that, you might also try using the Searxng MCP as it's a better web search engine one.

by redox9920 hours ago|

[-]

I tried most prompts that didn't rely on recent knowledge on the basic "AI Chat", not the "Agent skills" version.

I just tested "List the 5 most recent Argentina vice presidents" on E4B and it literally got all 5 wrong

by satvikpendem19 hours ago|

[-]

I use it for recommendations rather than knowledge, like recipes or basic stuff like that rather than knowledge, I mean it's likely due to its knowledge cutoff so it's not necessarily accurate. But the agent skills section does have a query Wikipedia tool call.

Try this on Unsloth Studio, they seem to have fixed Gemma tool calling.

by redox9919 hours ago|

[-]

Argentina vice presidents span from 2007 to 2023. Knowledge cutoff cant explain getting all 5 of them wrong.

by Melatonic18 hours ago|

[-]

What did it say were the presidents from those years?

by redox9917 hours ago|

[-]

It can answer presidents fine. It fails for vice presidents.

-----------

As of my last update, here are the five most recent individuals to have served as Vice President of Argentina:

Sergio Massa (Served as Vice President from 2019 to 2023)

Martín Lousteau (Served as Vice President from 2015 to 2019)

Cristina Fernández de Kirchner (Served as Vice President from 2007 to 2015)

Néstor Kirchner (Served as Vice President from 2003 to 2007)

Eduardo Duhalde (Served as Vice President from 1999 to 2003)

Note on the list: The term "most recent" can be interpreted in two ways:

Most recent to have served: This list follows that interpretation, showing the last five people who held the office.

Most recent current officeholders: If you are asking for the current Vice President, that position is currently held by Juan Manuel Moreno (who was appointed in 2024).

If you are looking for the current Vice President, please let me know!

by refulgentis1 days ago|

[-]

@google.com'ers, there are no GGUFs (blog says there is)

by minimaxir1 days ago|

[-]

Isn’t this it? https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf

by refulgentis1 days ago|

[-]

Ah, nice, ty! My excuse is those repos were added to the collection after my comment, but perhaps not :3

by comparedge23 hours ago|

[-]

[flagged]

by Pixel-Labs23 hours ago|

[-]

[flagged]

by spacebacon23 hours ago|

[-]

[flagged]

[-]

I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them.

I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.

by adam_arthur21 hours ago|

[-]

The entire universe of automation projects that can be run effectively for free relative to SoTA models?

I don't think many realize that most LLM embedded automation, pipelines, products will soon be able to run extremely cheaply on models < 100B parameters.

Frontier models will be used for coding/creation use cases, yes. But for all the pseudo-deterministic, pipeline, analysis style things there will be no practical benefit to running frontier models, only additional cost.

Gemma 4 26B outperforms most 100-200B models that I've tested for reasoning and structured output.

Gemma 4 12B can consistently select where to click on browser images given a minimal prompt, and do so very quickly.

by dofm20 hours ago|

[-]

The 26B model is really surprising, and it is impressively concise — it spends a lot less time dithering than Qwen3.6.

[-]

Practically if you're running a small personal automation project you're not going to want to waste a lot of time configuring and tuning a local model. You want to build the automation and move on.

If you're building a automation as a company you definitely won't want to take on the long term maintenance overhead of running your own models for some automation project.

by adam_arthur21 hours ago|

[-]

These small models exist in the cloud and are/will be priced commensurately to their size.

Your claim is effectively that companies don't care about operational/cloud costs. Even pre-LLM, companies regularly assessed and tried to pare down cloud spend.

by sowbug1 hours ago|

[-]

Whatever you're doing, try doing 500 or 1,000 of it in a batch. You'll exhaust any subscription quota you have, or if you're paying per token, you will probably find it too expensive. That's when you'll start to ask "how smart a model do I really need for this job?", and you'll investigate running a small but sufficiently capable model on your own PC, churning overnight through your 1,000 tasks.

by mikeocool21 hours ago|

[-]

> I've been using Claude and GPT models for years

All 3 years?

[-]

GPT1 was released in 2018, so yes, since then.

by victorbjorklund1 hours ago|

[-]

GPT1 was way worse than small Gemma’s are now.

by Zambyte21 hours ago|

[-]

I like using my computer.

[-]

Exactly, thank you, we are on the same page! It's great to be able to use our own devices and not have their compute coopted by a third party.

I'd rather not have intensive compute needed shifted onto my personal machine which I want to use for something else.

[-]

By that logic, any software you run that isn't fully built by yourself is "third party" therefore you shouldn't run anything at all on your machine, thus obviating the need for it entirely.

by steno13220 hours ago|

[-]

But practically AI inference requires substantial local computing resources. It's not some web app, it's a order of magnitude more compute needed

by Zambyte20 hours ago|

[-]

Hopefully now you understand why people want smaller models.

[-]

Not really, I run a production service on a basic server using these Gemma models, the server is weaker than my MacBook. Most people's laptops and even phones actually can run local models, most simply don't know how. Run Unsloth Studio and you'll see how easy it is.

As the sibling says this is why people want smaller but still performant models.

by 20 hours ago|

[-]

deleted

by Zambyte21 hours ago|

[-]

I am not a "third party" on my own computer.

by user272221 hours ago|

[-]

There is tinfoil.sh as well but honestly running this stuff on an airgapped server allows a better peace of mind about the data being used for something else.

[-]

What's wrong with the data being used for something else? Someone is providing digital intelligence to us, saving us many hours a week, so the least we can do is provide them a little data so they are able to improve their service.

It would be selfish and unethical not to in my view. And ultimately the data is just being used in order to improve the models and benefit us, not for anything nefarious.

by NicuCalcea19 hours ago|

[-]

If sharing our data is the least we can do, they shouldn't also ask us for our money. Otherwise, it's more than the least.

by mannanj21 hours ago|

[-]

I don't like the gaslighting of paying Anthropic or Open(Closed)AI and it being said its unsustainable for them to take my payment while simultaneously they take my data (edit: which is incredibly valuable) and I cannot opt out of that.

The obsession is for leaving hostile and abusive entities, the corporations or the people who fund them that have a horrible track record in regards to ethicality, rights and respect & human dignity.

[-]

My view is, if you're going to use the service - you should give the data.

It's like using Gmail and expecting them not to train their AI models on your data - how can you expect that when they're giving you a secure, reliable, highly functional email client completely for free?

The digital economy only works if everyone pays their fair share. If you don't want to give your data then you are really harming everyone by slowing down AI development for everyone else.

by klardotsh20 hours ago|

[-]

Because we pay for the models.

If I pay you for a service, what implicit right should you have to then continue to profit in perpetuity by storing the data I paid you to process?

If LLMs were free your Gmail analogy might hold up. They aren’t, and so it doesn’t.

AI development can continue with the data folks opt into, or with the data AI companies incessantly scrape with reckless disregard for polite system loads. AI development does not require retaining all user inputs forever.

by mannanj19 hours ago|

[-]

However, you didn't actually get what I meant down, so you ended up inadvertently Straw Manning me.

My disinterest is in sharing my intellectual IP. Most people up to now, have never shared this much of their intellectual IP with a company. Name one product through human history before that got this much data and insight into human thinking and now can use your most intimate conversations, ideas and needs for non-training purposes?

You can't even opt out of that! At least for the training data you can opt-out.

by satvikpendem12 hours ago|

[-]

Intellectual "property" is not real property. While I disagree with the parent on many things as my comments show, IP is not one of them. Information should be free, for anyone and everyone.

by mannanj3 hours ago|

[-]

Another straw man.

"real" property or not. You agree that we have some right to our own outputs, right? Is that not dignity, to say "I want my outputs protected".

Seems like you think that your ideas should be free, as you called it information. How about you back that up with action... please send me all your most intimate, valuable ideas. Oh no, you don't feel comfortable? Then why are you sharing it with companies?

by mannanj19 hours ago|

[-]

Apple is a good example of ethical services. They still give you privacy and ownership of your data, you keep your dignity and data. Google is a horrible model for this - it matches the whole thing about unethical, abusive, gaslighting relationships I described.

by satvikpendem12 hours ago|

[-]

The same Apple that takes 30% of all transactions? In reality no corporation is in the public interest.

by mannanj3 hours ago|