A simple benchmark comparing the Qwen 3.5 4B model running on Apple MLX and Ollama on a base M1 iMac.
For this test, I ran the Qwen 3.5 4B model in two different environments: Apple's MLX framework using an 8-bit quantized model, and Ollama using a 4-bit quantized model.
Both tests used the same input and were executed on the same machine. The request contained a single prompt:
Hello!
Streaming was disabled, so each API call returned only after the entire response had been generated.
Test Environment
The benchmark was performed on the following system:
- Hardware: Apple iMac (M1 Chip, 8 Core CPU / 8 Core GPU, Unified Memory Architecture)
- Operating System: macOS
- Engine A (MLX):
mlx_lm.serverhostingmlx-community/Qwen3.5-4B-8bit(~5.2 GB) - Engine B (Ollama):
ollamahostingqwen3.5:4b(~3.4 GB, Q4_K_M quantization)
Test 1: MLX (8-bit)
The first test used Apple's MLX framework with the 8-bit version of the model.
Command used:
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Qwen3.5-4B-8bit", "messages": [{"role": "user", "content": "Hello!"}]}' \
-w "\n| Total Time: %{time_total}s |\n"
Results
- Total Wall-Clock Time: 77.56 seconds
- Total Generated Tokens: 866 tokens
The response took over a minute to complete. Because streaming was disabled, the client did not receive any output until generation finished. The model generated 866 tokens before returning the final response.
Test 2: Ollama (4-bit)
The second test used Ollama with the default 4-bit version of the same model.
Command used:
curl -X POST http://127.0.0.1:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3.5:4b", "messages": [{"role": "user", "content": "Hello!"}]}' \
-w "\n| Total Time: %{time_total}s |\n"
Results
- Total Wall-Clock Time: 23.61 seconds
- Total Generated Tokens: 392 tokens
The Ollama response completed much faster and generated fewer tokens than the MLX test.
Comparing the Results
| Metric | MLX (8-bit) | Ollama (4-bit) |
|---|---|---|
| Response Time | 77.56 s | 23.61 s |
| Generated Tokens | 866 | 392 |
| Model Size | ~5.2 GB | ~3.4 GB |
For this benchmark, Ollama completed the same request approximately 3.2× faster than MLX.
One noticeable difference was token generation. The MLX run produced 866 tokens, while the Ollama run produced 392 tokens. Processing more tokens generally takes more time, which may have contributed to the latency difference observed in this test.
Another difference was model size. The MLX model occupied approximately 5.2 GB on disk, while the Ollama model occupied approximately 3.4 GB.
Notes About This Test
This benchmark reflects a single prompt executed on a base M1 system. Results may vary depending on factors such as:
- Prompt complexity
- Model settings
- Quantization format
- Hardware configuration
- Engine implementation
- Whether streaming is enabled
The purpose of this test was not to determine which framework is universally better, but to compare how these two configurations behaved under the same conditions on the same machine.
Final Takeaways
- The Ollama 4-bit configuration responded significantly faster on this base M1 system.
- The Ollama run generated fewer tokens than the MLX run.
- The model sizes were noticeably different (3.4 GB vs 5.2 GB).
- Streaming can improve the user experience by displaying output as it is generated rather than waiting for the full response.
To enable streaming, add the following field to your API request payload:
"stream": true
0 comments:
Post a Comment