A) Embeddings.
B) Things like classification, structured outputs, image labelling etc.
C) Image generation.
D) LLM chatbot for answering questions, improving email drafts etc.
E) Agentic coding.
?
I have a MBP with M1 Max and 32GB RAM. I can run a 20GB mlx_vlm model like mlx-community/Qwen3.5-35B-A3B-4bit. But:
- it's not very fast
- the context window is small
- it's not useful for agentic coding
I asked "What was mary j blige's first album?" and it output 332 tokens (mostly reasoning) and the correct answer.
mlx_vlm reported:
Prompt: 20 tokens @ 28.5 t/s | Generation: 332 tokens @ 56.0 t/s | Peak memory: 21.67 GB