Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)
Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.
TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model
Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources
generally, the less parameters, the less knowledge they have.
And not even diehard Apple fanboys deny this.
I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.
There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.
Reminds me of the saying: "A fool and his money are soon parted".
That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.
Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)