undefined

upvote

points

by jbellis9 hours ago |

upvote

by butILoveLife6 hours ago|

[-]

>Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization

Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)

reply

upvote

by knicholes3 hours ago|

[-]

I wonder if Apple has foresight into locally running LLMs becoming sufficiently useful.

reply

upvote

by DiscourseFan11 minutes ago|

[-]

It won’t handle serious tasks but I have Gemma 3 installed on my M2 Mac and it is good for most of my needs—-esp data I don’t want a corporation getting its hands on.

reply

upvote

by b11224 minutes ago|

[-]

They do! "You're holding it wrong*

reply

upvote

by gslepak4 hours ago|

[-]

That is talking about battery life, not AI tasks. Footnote 53, where it says, "Up to 18 hours battery life":

https://www.apple.com/macbook-pro/

reply

upvote

by whynotmaybe6 hours ago|

[-]

Quite interesting that it's now a selling point just like fps in Crysis was a long time ago.

reply

upvote

by re-thc6 hours ago|

[-]

Next is the fps of an AI playing Crysis.

reply

upvote

by dana3216 hours ago|

[-]

Or tasks per minute of the AI doing your job for you

reply

upvote

by jayde27674 hours ago|

[-]

That measurement will be AI assembling MacBook pros vs human assemblers: number of units per hour, day, or whatever unit is most applicable.

reply

upvote

by re-thc6 hours ago|

[-]

[flagged]

reply

upvote

by fulafel5 hours ago|

[-]

So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

reply

upvote

by dotancohen3 hours ago|

[-]

For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.

reply

upvote

by raw_anon_111139 minutes ago|

[-]

The way that voice assistants work even in the age of LLMs is

Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.

TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model

Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources

reply

upvote

by easygenes1 hours ago|

[-]

Topical. My hobby project this week (0) has been hyper-optimizing microgpt for M5's CPU cores (and comparing to MLX performance). Wonder if anything changes under the regime I've been chasing with these new chips.

0: https://entrpi.github.io/eemicrogpt/

reply

upvote

by gok1 hours ago|

[-]

consider using fp16 or bf16 for the matrix math (in SME you can use svmopa_za16_f16_m or svmopa_za16_bf16_m)

reply

upvote

by lastdong6 hours ago|

[-]

14-billion parameter model with 4-bit quantization seems rather small

reply

upvote

by derefr4 hours ago|

[-]

I think these aren't meant to be representative of arbitrary userland-workload LLM inferences, but rather the kinds of tasks macOS might spin up a background LLM inference for. Like the Apple Intelligence stuff, or Photos auto-tagging, etc. You wouldn't want the OS to ever be spinning up a model that uses 98% of RAM, so Apple probably considers themselves to have at most 50% of RAM as working headroom for any such workloads.

reply

upvote

by simlevesque6 hours ago|

[-]

It's not much for a frontier AI but it can be a very useful specialized LLM.

reply

upvote

by giancarlostoro6 hours ago|

[-]

On my 24GB RAM M4 Pro MBP some models run very quickly through LM Studio to Zed, I was able to ask it to write some code. Course my fan starts spinning off like the worlds ending, but its still impressive what I can do 100% locally. I can't imagine on a more serious setup like the Mac Studio.

reply

upvote

by efxhoy5 hours ago|

[-]

How is the output quality of the smaller models?

reply

upvote

by elsombrero1 hours ago|

[-]

not good enough for coding anything more than simple scripts.

generally, the less parameters, the less knowledge they have.

reply

upvote

by butILoveLife6 hours ago|

[-]

For anyone who has been watching Apple since the iPod commercials, Apple really really has grey area in the honesty of their marketing.

And not even diehard Apple fanboys deny this.

I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.

reply

upvote

by mptest43 minutes ago|

[-]

In retrospect, was there a better place to learn about the cruelty of the world than runescape? Must've got scammed thrice before I lost the youthful light in my eye

reply

upvote

by zitterbewegung6 hours ago|

[-]

Yesterday I ran qwen3.5:27b with an M1 Max and 64 GB of ram. I have even run Llama 70B when llama.cpp came out. These run sufficiently well but somewhat slow but compared to what the improvements with the M5 Max it will make it a much faster experience.

reply

upvote

by giwook5 hours ago|

[-]

I don't know that there would be a huge overlap between the people who would fall for this type of marketing and the people who want to run LLMs locally.

There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.

Reminds me of the saying: "A fool and his money are soon parted".

reply

upvote

by nine_k2 hours ago|

[-]

There used to be a polite way to call this out, the "Steve Jobs's reality distortion field".

reply

upvote

by hamdingers1 hours ago|

[-]

Now that every CEO has their own reality distortion field I wonder if it's even worth calling out any more.

reply

upvote

by nine_k37 seconds ago|

[-]

Most are not nearly as smooth and successful at the distorting.

reply

upvote

by bilbo0s6 hours ago|

[-]

It is.

That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.

reply

upvote

by Havoc2 hours ago|

[-]

Does that include loading the model again? Apple seems to be the only company doing such shenanigans in their measurements

reply

upvote

by azinman29 hours ago|

[-]

Seems very reasonable to me

reply

upvote

by tux38 hours ago|

[-]

A bit strange to use time to first token instead of throughput.

Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)

reply

upvote

by jbellis8 hours ago|

[-]

As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.

reply

upvote

by GeekyBear8 hours ago|

[-]

In previous generations, throughout was excellent for an integrated GPU, but the time to first token was lacking.

reply

upvote

by danudey8 hours ago|

[-]

So throughput was already good but TTFT was the metric that needed more improvement?

reply

upvote

by zamadatix7 hours ago|

[-]

To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.

reply

upvote

by convenwis8 hours ago|

[-]

Good is relative but first token was clearly the biggest limitation.

reply

upvote

by hedgehog2 hours ago|

[-]

Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.

reply

upvote

by case5408 hours ago|

[-]

I assume it’s time to first output token so it’s basically throughput. How fast can it output 8001 tokens

reply

upvote

by fragmede8 hours ago|

[-]

No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.

reply

upvote

by MrDrMcCoy8 hours ago|

[-]

Some kinds of spinners serve as a coal-mine canary indicating if the app has gotten wedged. Not hugely useful, but also not entirely useless.

reply

upvote

by nabakin8 hours ago|

[-]

I would consider it reasonable if this was 4x TTFT and Throughput, but it seems like it's only for TTFT.

reply

upvote

by nullbyte8081 hours ago|

[-]

Like saying my PC boots up 2x faster so it must be 2x more powerful. lol

reply