upvote
> I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

Yeah, I didn't write this as a proper developer guide. My screen recording started getting loads of favourites and I started getting messages asking about how I set it up, so just through up a quick rundown of how I setup this test.

I little just saw the Unclothe announcement about "Double the speed" and thought "Ha. I wonder if that will get it fast enough I'd actually be prepared to use it" and had a go at setting it up.

I'd done tests before last year with things like Devstral, but they were always both so slow and dumb, I didn't want to bother.

This finally hit the "wow, this is useable" level of both speed and intelligence.

reply
I wasn't familiar with Unclothe, so I had to look it up..

Are you sure you did not mean Unsloth?

reply
They likely did, and this autocorrect slip might suggest why OP is using local models :)
reply
Indeed, a clear Freudian slip. The one where you say one thing, but you mean your mother.
reply
Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good).

llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.

reply
At 128 tokens, you’re benchmarking the overture, not the opera.
reply
I thought the same thing when I started using locals, but the reality is that - for a given context depth - the token generation speed doesn't change whether it's 128 or 8000, it just lengthens the benchmark run time.
reply
This is akin to saying “it runs on my machine” without actually examining the problem. Sad. You’re absolutely right that 128 tokens is nothing, it’s a little more than a hello response.
reply