undefined

points

[-]

llamafile contains specific optimizations for prompt processing using AVX512 for dealing with just this issue: https://justine.lol/matmul/ (about a 10x speedup over llama.cpp)

Somewhere between 8 and 192 cores I'm sure there's enough AVX512 to get the job done. And we've managed to reinvent Intel's Larrabee / Knights concept.

Sadly, the highly optimized AVX512 kernels of llamafile don't support these exotic floats yet as far as I know.

Yes, energy efficiency per query will be terrible compared to a hyperscaler. However privacy will be perfect. Flexibility will be higher than other options - as running on the CPU is almost always possible. Even with new algorithms and experimental models.

by ein0p287 days ago|

parent|

[-]

At 192 cores you're way better off buying a Mac Studio, though.

by bigyabai283 days ago|

parent|

[-]

[flagged]