Apple Silicon LLM Speed

Select a chip to see how fast it runs local LLMs at different model sizes. Token generation is memory-bandwidth-bound — bigger bandwidth = faster output.

M4 Max

40+ tok/s — instant

20-40 — responsive

<20 — sluggish

Data compiled from: llama.cpp #4167 (Gerganov) • GPU-Benchmarks-on-LLM-Inference (Dai) • LLM Speed Comparison (Singh) • Apple MLX M5 Research • like2byte.com M4 Benchmarks • Awesome Agents M4 Max • Gemma 3 on M3 Ultra
All speeds are token generation (not prompt processing). Q4 quantization unless noted. Only exact measured numbers, no ranges or estimates.