Any other resources like that you could share?
Also, what kind of models do you run with mlx and what do you use them for?
Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)
I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)
I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio
The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy
Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:
• M1 Max (MLX native)
• LM Studio (GLM, MLX, GGUFs)
• Llama.cp (GGUFs)
• n8n for orchestration + automation (multi-stage LLM workflows)
My emerging use cases: -Rapid narration scripting -Roleplay agents with embedded prompt personas -Reviewing image/video attachments + structuring copy for clarity -Local RAG and eval pipelines
My current lineup of small LLMs (this changes every month depending on what is updated):
MLX-native models (mlx-community):
-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following
-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization
-GLM-Z1-9B-bf16 → reliable multilingual output + inference density
GGUF via LM Studio / llama.cpp:
-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue
-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once
-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts
Emerging / niche models tested:
MedFound-7B-GGUF → early tests for narrative medicine tasks
X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid
llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks
PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)
I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference. The meta-trend: models are getting better, smaller, faster, especially for edge workflows.
Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.