undefined

points

by ryandrake22 hours ago |

comments

by danielhanchen22 hours ago|

[-]

We made Unsloth Studio which should help :)

1. Auto best official parameters set for all models

2. Auto determines the largest quant that can fit on your PC / Mac etc

3. Auto determines max context length

4. Auto heals tool calls, provides python & bash + web search :)

by ryandrake21 hours ago|

parent|

[-]

Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system.

by hypercube3320 hours ago|

parent|

prev|

[-]

Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?)

by WanderPanda20 hours ago|

parent|

prev|

[-]

I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?

by Zopieux18 hours ago|

parent|

prev|

[-]

Thanks for that. Did you notice that the unsloth/unsloth docker image is 12GB? Does it embed CUDA libraries or some default models that justifies the heavy footprint?

by jbellis20 hours ago|

parent|

prev|

[-]

what are you using for web search?

by cyanydeez20 hours ago|

parent|

prev|

[-]

Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh?

by kristjansson20 hours ago|

parent|

[-]

Lmstudio Link is GREAT for that right now

by wuschel22 hours ago|

parent|

prev|

[-]

Great project! Thank you for that!

by Aurornis22 hours ago|

prev|

[-]

> Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL

There are actually two problems with this:

First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions.

Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM.

The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context.

by zargon20 hours ago|

parent|

[-]

Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.

by zargon17 hours ago|

parent|

[-]

I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.

by magicalhippo13 hours ago|

parent|

[-]

The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):

  llama_kv_cache: size = 5120.00 MiB (262144 cells,  10 layers,  4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB

The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.

Meanwhile here's for Qwen 3.6 27B:

  llama_kv_cache: size = 3072.00 MiB ( 49152 cells,  16 layers,  4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB

So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.

I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.

That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.

by smallerize20 hours ago|

parent|

prev|

[-]

I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant.

by ryandrake21 hours ago|

parent|

prev|

[-]

Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs.

by zozbot23421 hours ago|

parent|

[-]

That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput.

by 21 hours ago|

parent|

prev|

[-]

deleted

by khimaros21 hours ago|

parent|

prev|

[-]

Strix Halo is another option

by jmspring17 hours ago|

parent|

prev|

[-]

qwen3.5 27b w/ 4bit quant works reasonably on a 3090.

by dannyw14 hours ago|

prev|

[-]

Evaluating different quant levels for your use case is a problem you can pretty reliably throw at a coding agent and leave overnight though. At least, it should give you a much smaller shortlist.

by regularfry16 hours ago|

prev|

[-]

To add more complexity to the picture, you can run MoE models at a higher quant than you might think, because CPU expert offload is less impactful than full layer offload for dense models.

by mudkipdev16 hours ago|

prev|

[-]

HuggingFace has a nice UI where you can save your specs to your account and it will display a checkmark/red X next to every unsloth quantization to estimate if it will fit.