undefined

This is interesting work, thank you for sharing. What hardware would you buy today for experimenting? Seems like the new gen of macbook pros are pretty powerful?

by tatef3 hours ago|

parent|

[-]

Yes definitely. I use a M1 Max with 32gb of RAM daily and it's about on par from a performance standpoint with the new base M5 Pro 24gb. You can check the benchmarks in the repo if you're interested in seeing specific performance metrics, but investing in Apple hardware with as much memory as possible will generally get you furthest in this game.

by WithinReason5 hours ago|

prev|

[-]

Have you ever generated access frequency statistics for the experts in these models, something like a histogram?

by Gracana4 hours ago|

parent|

[-]

ktransformers can do dynamic placement of experts and could presumably produce such a histogram, though currently its activation statistics are just a ".pt" file. https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

FWIW I never got it to work and did not dig into it much.

by lostmsu5 hours ago|

prev|

[-]

Why would llama with --mmap crash?

by zozbot2345 hours ago|

parent|

[-]

This doesn't surprise me all that much, mmap support gets little attention in general and interacts poorly with GPU-side inference. (And that's with it being default, you don't even really need to specify it as a CLI option.) OP has raised a discussion with the llama.cpp folks https://github.com/ggml-org/llama.cpp/discussions/20852 but little interest so far