Select a chip to see how fast it runs local LLMs at different model sizes. Token generation is memory-bandwidth-bound — bigger bandwidth = faster output.
Data compiled from:
llama.cpp #4167 (Gerganov) •
GPU-Benchmarks-on-LLM-Inference (Dai) •
LLM Speed Comparison (Singh) •
Apple MLX M5 Research •
like2byte.com M4 Benchmarks •
Awesome Agents M4 Max •
Gemma 3 on M3 Ultra
All speeds are token generation (not prompt processing). Q4 quantization unless noted. Only exact measured numbers, no ranges or estimates.