undefined

upvote

points

by aliljet16 hours ago |

upvote

by oompydoompy7415 hours ago|

[-]

Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

reply

upvote

by danny_codes4 hours ago|

[-]

Exactly. Relying on external compute for professional work is a non-starter IMO.

reply

upvote

by seemaze15 hours ago|

[-]

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.

reply

upvote

by marssaxman16 hours ago|

[-]

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

reply

upvote

by znnajdla15 hours ago|

[-]

Some tasks don’t require SOTA models. For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline. Also if you can break down certain tasks like JSON healing into small focused coding tasks then local models are useful

reply

upvote

by oktoberpaard15 minutes ago|

[-]

Do you use E2B or E4B?

reply

upvote

by kaliqt13 hours ago|

[-]

Is it really better? In which languages?

reply

upvote

by deaux12 hours ago|

[-]

Yes it is and has been for a very long time, it has been years now. Gemini 1.5 Pro is when LLM translations started significantly outperforming non-LLM machine translation, and that came out over 2 years ago.

Ever since then Google models have been the strongest at translation across the board, so it's no surprise Gemma 4 does well. Gemini 3 Flash is better at translation than any Claude or GPT model. OpenAI models have always been weakest at it, continuing to this day. It's quite interesting how these characteristics have stayed stable over time and many model versions.

I'm primarily talking about non-trivial language pairs, something like English<>Spanish is so "easy" now it's hard to distinguish the strong models.

reply

upvote

by znnajdla1 hours ago|

[-]

I translate texts between Ukrainian, Russian and English dozens of times daily. The LLM translation is not only better, it's also refineable, you can chat with the AI to make changes to what you meant.

reply

upvote

by homebrewer11 hours ago|

[-]

I've been using gemma4 for translating Mongolian to English. It runs circles around Google Translate for that language pair, it's not even close.

reply

upvote

by mistercheese2 hours ago|

[-]

I use local models for asking about personal financial or health data that I want to keep local and private. Or even just whipping up quick and dirty prototypes for whatever I can think of but not seriously enough to spend tokens that I rather use on real projects.

reply

upvote

by ThatPlayer7 hours ago|

[-]

I'm using the smaller vision models (Qwen3.5-4B currently) with Frigate, a FOSS self-hosted "AI" NVR. It's good enough at analyzing images to figure out mostly what's happening, and doesn't require the big knowledge base that bigger models have.

Also use a bigger model for summarizing or translating text, which I don't consume in realtime, so doesn't need to be fast. Would be a thing I could use OpenAI's batch APIs for if I did need something higher quality.

reply

upvote

by lkjdsklf16 hours ago|

[-]

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)

reply

upvote

by deaux12 hours ago|

[-]

The open weights models absolutely compete with flagship labs for most tasks. OpenAI and Anthropic's "cheap tier" models are completely uncompetitive with them for "quality / $" and it's not close. Google is the only one who has remained competitive in the <$5/1M output tier with Flash, and now has an incredibly strong release with Gemma 4.

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.

reply

upvote

by kamranjon15 hours ago|

[-]

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

reply

upvote

by bildung15 hours ago|

[-]

The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.

reply

upvote

by Aurornis15 hours ago|

[-]

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.

reply

upvote

by toxik13 hours ago|

[-]

Oh lord, are the LLMs already replacing LLMs?

reply

upvote

by jwitthuhn12 hours ago|

[-]

I've been largely using Qwen3.5-122b at 6 bit quant locally for some c++/go/python dev lately because it is quite capable as long as I can give it pretty specific asks within the codebase and it will produce code that needs minimal massaging to fit into the project.

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

reply

upvote

by 38362936489 hours ago|

[-]

How much VRAM do you need for that?

reply

upvote

by seemaze8 hours ago|

[-]

I squeeze Qwen3.5-122B-A10B at Q6 into 128GB. It's a great model.

reply

upvote

by mistercheese2 hours ago|

[-]

Wow what kind of hardware do you have? Mac Studio, dgx spark, strix halo? How fast is it?

reply

upvote

by deaux14 hours ago|

[-]

While they can be run locally, and most of the discussion on HN about that, I bet that if you look at total tok/day local usage is a tiny amount compared to total cloud inference even for these models. Most people who do use them locally just do a prompt every now and then.

reply

upvote

by zozbot23414 hours ago|

[-]

This is why I'd like to see a lot more focus on batched inference with lower-end hardware. If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so, you don't really need top-of-the-line hardware even for SOTA results.

reply

upvote

by mistercheese2 hours ago|

[-]

That’s a good point. I think I saw Together.ai with that offering, but for some reason just never think to throw random non urgent coding tasks at it overnight

reply

upvote

by deaux12 hours ago|

[-]

> If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so

But they can't? The usage pattern is the polar opposite. Most people running these models locally just ask a few questions to it throughout the day. They want the answers now, or at least within a minute.

reply

upvote

by zozbot23412 hours ago|

[-]

If you want the answer right now, that alone ups your compute needs to the point where you're probably better off just using a free hosted-AI service. Unless the prompt is trivial enough that it can be answered quickly by a tiny local model.

reply

upvote

by flux312516 hours ago|

[-]

They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens

reply

upvote

by zackify7 hours ago|

[-]

always inside claude code, just using ollama, takes 2 seconds

reply

upvote

by Panda416 hours ago|

[-]

I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?

reply

upvote

by kylehotchkiss13 hours ago|

[-]

I am working on a research project to link churches from their IRS Exempt org BMF entry to their google search result from 10 fetched. Gwen2.5-14b on a 16gb Mac Mini. It works good enough!

It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.

reply

upvote

by dist-epoch14 hours ago|

[-]

There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.

reply

upvote

by ssrshh5 hours ago|

[-]

[dead]

reply