upvote
>Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization

Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)

reply
I wonder if Apple has foresight into locally running LLMs becoming sufficiently useful.
reply
It won’t handle serious tasks but I have Gemma 3 installed on my M2 Mac and it is good for most of my needs—-esp data I don’t want a corporation getting its hands on.
reply
They do! "You're holding it wrong*
reply
That is talking about battery life, not AI tasks. Footnote 53, where it says, "Up to 18 hours battery life":

https://www.apple.com/macbook-pro/

reply
Quite interesting that it's now a selling point just like fps in Crysis was a long time ago.
reply
Next is the fps of an AI playing Crysis.
reply
Or tasks per minute of the AI doing your job for you
reply
That measurement will be AI assembling MacBook pros vs human assemblers: number of units per hour, day, or whatever unit is most applicable.
reply
[flagged]
reply
So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.
reply
For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.
reply
The way that voice assistants work even in the age of LLMs is

Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.

TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model

Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources

reply
Topical. My hobby project this week (0) has been hyper-optimizing microgpt for M5's CPU cores (and comparing to MLX performance). Wonder if anything changes under the regime I've been chasing with these new chips.

0: https://entrpi.github.io/eemicrogpt/

reply
consider using fp16 or bf16 for the matrix math (in SME you can use svmopa_za16_f16_m or svmopa_za16_bf16_m)
reply
14-billion parameter model with 4-bit quantization seems rather small
reply
I think these aren't meant to be representative of arbitrary userland-workload LLM inferences, but rather the kinds of tasks macOS might spin up a background LLM inference for. Like the Apple Intelligence stuff, or Photos auto-tagging, etc. You wouldn't want the OS to ever be spinning up a model that uses 98% of RAM, so Apple probably considers themselves to have at most 50% of RAM as working headroom for any such workloads.
reply
It's not much for a frontier AI but it can be a very useful specialized LLM.
reply
On my 24GB RAM M4 Pro MBP some models run very quickly through LM Studio to Zed, I was able to ask it to write some code. Course my fan starts spinning off like the worlds ending, but its still impressive what I can do 100% locally. I can't imagine on a more serious setup like the Mac Studio.
reply
How is the output quality of the smaller models?
reply
not good enough for coding anything more than simple scripts.

generally, the less parameters, the less knowledge they have.

reply
For anyone who has been watching Apple since the iPod commercials, Apple really really has grey area in the honesty of their marketing.

And not even diehard Apple fanboys deny this.

I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.

reply
In retrospect, was there a better place to learn about the cruelty of the world than runescape? Must've got scammed thrice before I lost the youthful light in my eye
reply
Yesterday I ran qwen3.5:27b with an M1 Max and 64 GB of ram. I have even run Llama 70B when llama.cpp came out. These run sufficiently well but somewhat slow but compared to what the improvements with the M5 Max it will make it a much faster experience.
reply
I don't know that there would be a huge overlap between the people who would fall for this type of marketing and the people who want to run LLMs locally.

There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.

Reminds me of the saying: "A fool and his money are soon parted".

reply
There used to be a polite way to call this out, the "Steve Jobs's reality distortion field".
reply
Now that every CEO has their own reality distortion field I wonder if it's even worth calling out any more.
reply
Most are not nearly as smooth and successful at the distorting.
reply
It is.

That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.

reply
Does that include loading the model again? Apple seems to be the only company doing such shenanigans in their measurements
reply
Seems very reasonable to me
reply
A bit strange to use time to first token instead of throughput.

Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)

reply
As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.
reply
In previous generations, throughout was excellent for an integrated GPU, but the time to first token was lacking.
reply
So throughput was already good but TTFT was the metric that needed more improvement?
reply
To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.
reply
Good is relative but first token was clearly the biggest limitation.
reply
Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.
reply
I assume it’s time to first output token so it’s basically throughput. How fast can it output 8001 tokens
reply
No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.
reply
Some kinds of spinners serve as a coal-mine canary indicating if the app has gotten wedged. Not hugely useful, but also not entirely useless.
reply
I would consider it reasonable if this was 4x TTFT and Throughput, but it seems like it's only for TTFT.
reply
Like saying my PC boots up 2x faster so it must be 2x more powerful. lol
reply