undefined

points

[-]

I have vLLM running on a Linux machine in my basement, connected with Tailscale, and I use small models as part of tasks like this:

- Transcribing scanned documents into formatted text

- Captioning/describing images and classifying them for audience suitability (includes anti-spam)

- Matching documents with relevant Wikipedia pages for tagging

I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.

I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.

by robgough1 days ago|

prev|

[-]

I've got a home-built dictation app that uses a local model to clear up the text and fix grammar. It was super easy to build. I’m extending it to capture meeting notes and summarise too. All on-device.

I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.

There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.

by fittingopposite13 hours ago|

parent|

[-]

That's a great user case. Am sorry using parakeet but sometimes it garbles up things. Can you open source it?

by robgough3 hours ago|

parent|

[-]

Mac only I'm afraid, but I already did [1]. Packaged it too as I figured it might be useful to others [2] (and I'd want to install it on machine's that I might not have Xcode on)

[1]: https://github.com/robgough/dictator [2]: https://dictator.robgough.net

I spent the best part of a couple weeks making improvements and tidying up the UI, but to actually get something working was essentially only a couple of prompts.

by lnenad7 hours ago|

parent|

prev|

[-]

Handy is open source and works flawlessly for me with Parakeet v3.

by properbrew1 days ago|

prev|

[-]

I think small models have a very good niche for specific tasks. I utilise a fine tuned Phi-4 model (smaller than this one) that fits in about 3.5gb of RAM (not vram) for the document processing side of things for the desktop app I develop (a bit of a shameless plug - whistle-enterprise.com).

If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.

by SeriousM1 days ago|

parent|

[-]

Thank you for sharing your usecase! I like your product very much!

Could you talk a bit how you did the finetuning? Did you use unsloth or any other tool and how went the verification to proof the outcome?

by properbrew22 hours ago|

parent|

[-]

Thank you!

Yea absolutely, but man, where to even start, it is very specific.

Fundementally I didn't use any wrappers like unsloth or axolotl, although I have used the latter before a year or two back and it was good, but I needed something very very custom. I also wanted the whole fine tuning pipeline to exported OpenVino model to be seamless.

I heavily leaned on codex, claude and some manual sleuthing around the internet to understand what I needed. I'd played about with QLoRA finetuning with axolotl before and felt most comfortable with that. So I needed to keep everything as stripped down as possible and figured I can just utilise the 3 main huggingface libraries (transformers, peft and datasets) and also bitsandbytes (as suggested by claude to quantize the model to keep this working on my GPU) along with some custom scripts generated by claude/codex (each cross referencing each other) that will do the different stages of the training run.

The next part was the data. Obviously didn't have access to thousands of meetings and associated output documents but I did have a 3090ti sitting there and a codex subscription. So I set about working out what format I needed the data in (many thanks again, to claude/codex) and started generating hundreds of different transcripts, different amounts of speakers, content, tones, subjects, spelling mistakes - like all the different things you could think a meeting would have. Then it's a case of actually generating a good meeting document off the back of the transcripts and creating the "gold standard" that we'd use.

I'm going to gloss over a lot here as I'd rather not detail it as it relates to some propriatary stuff that I had to work through, but you basically pair the transcripts together and run the training.

At the verification stage, there was pretty much 3 things:

1. "just" do some regex string matching to see if there's any of the source transcript key facts in the output to ensure fact preservation. Same with owner fabrication (who said what), I don't want something attributed to someone when it wasn't them that said it and then finally markdown validation.

2. Using codex/claude to validate the transcript and output from the model - I used the latest frontier models, probably overkill for my task, but they were good at the job

3. Finally me going through some actual recordings of myself, groups, meetings and manually verifiying the output

So a fair bit of work, and for context I'm on version 10 now, so it's been a journey!

by quickthoughts1 days ago|

prev|

[-]

I use small models like Gemma to improve transcriptions from ASR models amongst other micro-tasks. I actually built out a fine-tuning whisper pipeline with all local (smaller) models meaning no cloud/big-tech co is able to train/sell my (private) data.

Repo is https://github.com/Rebreda/listenr - mainly geared toward Whisper fine-tuning, AMD hardware and local inference

by thot_experiment23 hours ago|

prev|

[-]

I don't know about this model, but the next one up, the 31B I've been using as an agentic coding assistant in OpenCode, and basically anything that's easy enough that I'd trust Sonnet to handle, I trust Gemma 4 to handle and it's been doing a great job, it surprises me positively much more often than negatively. I not infrequently run into situations where Gemma 4 fails to do the task and I switch to Opus 4.7 and it fails also.

by mhitza1 days ago|

prev|

[-]

In theory, locally you'd use these where lossiness is acceptable for audio transcription and image labeling (as simple examples).

In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.

by Aachen1 days ago|

prev|

[-]

"Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same

by OtherShrezzing23 hours ago|

prev|

[-]

I use them for research on new features. If my feature is going to interact with a frontier language model in prod, I start with these free local ones which are all competent enough to produce structured output, make tool calls, interact with mcp etc. I don’t care much for the content at the early phase of engineering, I care about the schema & failure modes.

Then when I’m getting close to feature-complete, I’ll move to a hosted frontier model for the final integration.

Cost savings are enormous if you’re making dozens of calls to language models a minute.

by SwellJoe21 hours ago|

prev|

[-]

I've used Gemma for reviewing and categorizing my writing online over several years (~5 million words across a forum for an OSS project I work on, HN, reddit, etc.), experimenting with training LoRAs (again, on my own writing, since I don't have to worry about ethically sourcing the data if it's all mine), and I'm currently using it to perform web searches and extract data about a specific type of business. It's plenty smart to use a web search MCP to find all the businesses of the right type in a given city, read their website, extract business address, phone number, etc. among other things, and de-dupe and cross-check other sources.

I found Gemma 4 to be better, or at least more nuanced, than Gemini 2.5 Flash. And, the new Gemini 3.5 Flash is very good but is unrealistically expensive (ten times more expensive than DeepSeek or MiMo). So, since I don't need extremely fast performance, a self-hosted Gemma 4 wins for a bunch of stuff.

I've also found Qwen 3.6 27B to be shockingly good at finding security bugs for its size. It beats several larger models, and is close to Gemini Pro 3.1 (but Gemini 3.5 Flash surprisingly beats it soundly). Since it only costs electricity, and my electricity is cheap and 100% renewable, I can use it more broadly than I might otherwise use a hosted model.

All that said, the smart money is still on buying the subsidized tokens from the providers that offer them, rather than buying the hardware needed to run models that are 30+GB in size, as all of the ones I've been using regularly are (8-bit quantization, as they get a little dumber for every bit you drop below that). A $100 subscription to Claude or Codex currently provides access to the best models at a heavily discounted rate. And, DeepSeek/MiMo are extremely cheap, one or more orders of magnitude cheaper than the top models from Anthropic or OpenAI, if you need an API for automated usage. I spent about $4000 on my two inference machines (a Strix Halo with 128GB unified RAM, and a new desktop build based around two cheap old 32GB AMD data center GPUs), which buys a lot of tokens for tiny models like this...probably a couple/few years worth. But, I like tinkering, so having an excuse to play with hardware is its own reward. If it happens to pay me back some of that money, that's a bonus.

Of course, as the major providers decide they need to ring the cash register and stop burning money on subsidized tokens, that math may change, and I may find I'm grateful to have already bought this stuff before the RAM prices made everything 2-3x more expensive.

But, I think if you're not interested in learning about the technology and doing your own training experiments and such, you should probably not try to run stuff locally most of the time.

by ai_fry_ur_brain20 hours ago|

parent|

[-]

So one of thr things you're using it for is to generate leads to spam businesses with unwanted LLM produced marketing materials it sounds like.

Wow LLMs are changing the world, what a utopia.

by SwellJoe20 hours ago|

parent|

[-]

> So one of thr things you're using it for is to generate leads to spam businesses with unwanted LLM produced marketing materials it sounds like.

You don't know me. And, no.

by pilooch20 hours ago|

prev|

[-]

Yes, all my emails gyer sorted out by a finetuned gemma. There are turned into images passes to the model, as multimodal is so practical.

by Xiol1 days ago|

prev|

[-]

I've yet to see someone answer a question like this with a decent, useful answer.

by sureglymop18 hours ago|

prev|

[-]

I moreso run other small special purpose models like Whisper, SAM, Matcha, CLIP etc. and then do contextual correction passes with models like this.

Think almost like unix pipelines, have used it for many workflows.

by airstrike1 days ago|

prev|

[-]

This is one https://post.bot/

by bensyverson23 hours ago|

parent|

[-]

What model is it using?

by airstrike21 hours ago|

parent|

[-]

I do not know which model specifically, but I saw the founder answering a question about how it's a small model that's focused on just this one specific requirement.

I expect it to be something like https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF

by ai_fry_ur_brain20 hours ago|

parent|

prev|

[-]

Why would I want an AI receptionist. A human receptionist is about 1000x more careful, caring and intentional.

They are charging $15.00 an hour for an llm powered assistant. Like wtf, how do these people think that's a valid business model. This will 1000% annoy every customer that uses it. I hate this timeline so much.

by airstrike20 hours ago|

parent|

[-]

No, this is a phone service. They charge $0.25 per minute on the phone on a call that would otherwise not connect.

Can you call a receptionist at 10pm and book an appointment? Or ask for directions? What if it's 10am and she's already on the line with someone else and you just want to ask if there's parking?

by ai_fry_ur_brain18 hours ago|

parent|

[-]

Please tell me what 0.25c x 60 is.

Yes, they're called after hours answering services and they're exponentially better because I get to talk to a human.

If my doctors office replaced a receptionist with this I would switch and leave bad reviews across every platform possible.

Ive already switched doctors once because they used an LLM transcription service during my appoitment that influenced the doctors recommendations for care. Sorry technology does not belong everywhere.

AI produces low quality work and will turn your business to shit.

by gnabgib14 hours ago|

parent|

[-]

Are you, perhaps, missing that $0.25/minute is only minutes on call? An agent not answering the phone for an hour is $0 (not $15).. for after-hours calls (rare) this is a meagre rate, compared to pay-per-hour (no matter the call volume) answering services.