upvote
It all depends on how cheap they can get. And another interesting thought: what if you could stack them? For example you have a base model module, then new ones come out that can work together with the old ones and expanding their capabilities.
reply
New GPUs come out all the time. New phones come out (if you count all the manufacturers) all the time. We do not need to always buy the new one.

Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.

reply
hm yeah I guess if they stick to shitty models it works out, I was talking about the models people use to actually do things instead of shitposting from openclaw and getting reminders about their next dentist appointment.
reply
The trick with small models is what you ask them to do. I am working on a data extraction app (from emails and files) that works entirely local. I applied for Taalas API because it would be awesome fit.

dwata: Entirely Local Financial Data Extraction from Emails Using Ministral 3 3B with Ollama: https://youtu.be/LVT-jYlvM18

https://github.com/brainless/dwata

reply
Considering that enamel regrowth is still experimental (only curodont exists as a commercial product), those dentist appointments are probably the most important routine healthcare appointments in your life. Pick something that is actually useless.
reply
If you need a full blown llm with root access to all your devices to remind you about an appointment something is very wrong with your life.
reply
To run Llama 3.1 8B locally, you would need a GPU with a minimum of 16 GB of VRAM, such as an NVIDIA RTX 3090.

Talas promises a 10x higher throughtput, being 10x cheaper and using 10x less electricity.

Looks like a good value proposition.

reply
> To run Llama 3.1 8B locally, you would need a GPU with a minimum of 16 GB of VRAM, such as an NVIDIA RTX 3090

In full precision, yes. But this talaas chip uses a heavily quantized version (the article calls it "3/6 bit quant", probably similar to Q4_K_M). You dont even need a GPU to run that with reasonable performance, a CPU is fine.

reply
What do you do with 8b models ? They can't even reliably create a .txt file or do any kind of tool calling
reply
Exploration, summarization, classification, translation
reply
Re-read Brave New World. Deltas and Epsilons have their place, even if Alphas and Betas got smarter overnight.

Roof! Roof!

reply
You obviously don't believe that AGI is coming in two release cycles, and you also don't seem to have much faith in the new models containing massive improvements over the last ones. So the answer to who is going to pay for these custom chips seems to be you.
reply
Why would I buy chips to run handicapped models when the 10+ llms players all offer free tier access to their 1t+ parameters models ?
reply
Do you think the free gravy train will run forever?
reply
Not all applications are chatbots. Many potential uses for LLMs/VLAMs are latency constrained.
reply
I'm guessing this development will make the fabrication of custom chips cheaper.

Exciting times.

reply
Probably the datacenters that serve those models?
reply
Almost all LLM companies have some sort of free tier that does nothing but lose them money.
reply