https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
For the best quality reply, I used the Gemma-4 31B UD-Q8_K_XL quant with Unsloth Studio to summarize the URL with web search. It produced 4.9 tok/s (including web search) on an MacBook Pro M1 Max with 64GB.
Here an excerpt of it's own words:
Unsloth Dynamic 2.0 Quantization
Dynamic 2.0 is not just a "bit-reduction" but an intelligent, per-layer optimization strategy.
- Selective Layer Quantization: Instead of making every layer 4-bit, Dynamic 2.0 analyzes every single layer and selectively adjusts the quantization type. Some critical layers may be kept at higher precision, while less critical layers are compressed more.
- Model-Specific Tailoring: The quantization scheme is custom-built for each model. For example, the layers selected for quantization in Gemma 3 are completely different from those in Llama 4.
- High-Quality Calibration: They use a hand-curated calibration dataset of >1.5M tokens specifically designed to enhance conversational chat performance, rather than just optimizing for Wikipedia-style text.
- Architecture Agnostic: While previous versions were mostly effective for MoE (Mixture of Experts) models, Dynamic 2.0 works for all architectures (both MoE and non-MoE).