undefined

points

[-]

> I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

Yeah, I didn't write this as a proper developer guide. My screen recording started getting loads of favourites and I started getting messages asking about how I set it up, so just through up a quick rundown of how I setup this test.

I little just saw the Unclothe announcement about "Double the speed" and thought "Ha. I wonder if that will get it fast enough I'd actually be prepared to use it" and had a go at setting it up.

I'd done tests before last year with things like Devstral, but they were always both so slow and dumb, I didn't want to bother.

This finally hit the "wow, this is useable" level of both speed and intelligence.

by Phemist5 hours ago|

parent|

[-]

I wasn't familiar with Unclothe, so I had to look it up..

Are you sure you did not mean Unsloth?

by threecheese3 hours ago|

parent|

[-]

They likely did, and this autocorrect slip might suggest why OP is using local models :)

by Phemist3 hours ago|

parent|

[-]

Indeed, a clear Freudian slip. The one where you say one thing, but you mean your mother.

by liuliu23 hours ago|

prev|

[-]

Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good).

llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.

by willXare20 hours ago|

prev|

[-]

At 128 tokens, you’re benchmarking the overture, not the opera.

by lloyd-christmas20 hours ago|

parent|

[-]

I thought the same thing when I started using locals, but the reality is that - for a given context depth - the token generation speed doesn't change whether it's 128 or 8000, it just lengthens the benchmark run time.

by reactordev22 hours ago|

prev|

[-]

This is akin to saying “it runs on my machine” without actually examining the problem. Sad. You’re absolutely right that 128 tokens is nothing, it’s a little more than a hello response.