upvote
> As you have so much RAM I would suggest running Q8_0 directly

On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.

> And just to be sure: you're are running the MLX version, right?

Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.

> but has since been fixed on the main branch

That's good to know, I will play around with it.

reply
> That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch

Unfortunately I have got zero success running gemma with mlx-lm main branch. Can you point me out what is the right way? I have zero experience with mlx-lm.

reply
Get into a venv, and run:

> pip3 install git+https://github.com/ml-explore/mlx-lm.git

> ./venv/bin/mlx_lm.generate --model "$MODEL" --temp 1.0 --top-p 0.95 --top-k 64 --max-tokens 128000 --prompt "Hello world"

Where $MODEL is an unsloth model like:

- unsloth/gemma-4-E4B-it-UD-MLX-4bit

- unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit

reply
Thanks!
reply
Gemma 4 is not supported by the MLX engine yet.
reply
It is, as I'm running it; it has been added this week. As I said I'm running the main version from Github and doing nothing special, see: https://news.ycombinator.com/item?id=47761308
reply