I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)
I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio
The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy
Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:
• M1 Max (MLX native)
• LM Studio (GLM, MLX, GGUFs)
• Llama.cp (GGUFs)
• n8n for orchestration + automation (multi-stage LLM workflows)
My emerging use cases: -Rapid narration scripting -Roleplay agents with embedded prompt personas -Reviewing image/video attachments + structuring copy for clarity -Local RAG and eval pipelines
My current lineup of small LLMs (this changes every month depending on what is updated):
MLX-native models (mlx-community):
-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following
-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization
-GLM-Z1-9B-bf16 → reliable multilingual output + inference density
GGUF via LM Studio / llama.cpp:
-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue
-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once
-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts
Emerging / niche models tested:
MedFound-7B-GGUF → early tests for narrative medicine tasks
X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid
llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks
PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)
I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference. The meta-trend: models are getting better, smaller, faster, especially for edge workflows.
Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.