Jun 9, 2026

Comparing MLX and Ollama on a Base M1 Mac

A simple benchmark comparing the Qwen 3.5 4B model running on Apple MLX and Ollama on a base M1 iMac.

For this test, I ran the Qwen 3.5 4B model in two different environments: Apple's MLX framework using an 8-bit quantized model, and Ollama using a 4-bit quantized model.

Both tests used the same input and were executed on the same machine. The request contained a single prompt:

Hello!

Streaming was disabled, so each API call returned only after the entire response had been generated.

Test Environment

The benchmark was performed on the following system:

  • Hardware: Apple iMac (M1 Chip, 8 Core CPU / 8 Core GPU, Unified Memory Architecture)
  • Operating System: macOS
  • Engine A (MLX): mlx_lm.server hosting mlx-community/Qwen3.5-4B-8bit (~5.2 GB)
  • Engine B (Ollama): ollama hosting qwen3.5:4b (~3.4 GB, Q4_K_M quantization)

Test 1: MLX (8-bit)

The first test used Apple's MLX framework with the 8-bit version of the model.

Command used:

curl -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Qwen3.5-4B-8bit", "messages": [{"role": "user", "content": "Hello!"}]}' \
  -w "\n| Total Time: %{time_total}s |\n"

Results

  • Total Wall-Clock Time: 77.56 seconds
  • Total Generated Tokens: 866 tokens

The response took over a minute to complete. Because streaming was disabled, the client did not receive any output until generation finished. The model generated 866 tokens before returning the final response.

Test 2: Ollama (4-bit)

The second test used Ollama with the default 4-bit version of the same model.

Command used:

curl -X POST http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3.5:4b", "messages": [{"role": "user", "content": "Hello!"}]}' \
  -w "\n| Total Time: %{time_total}s |\n"

Results

  • Total Wall-Clock Time: 23.61 seconds
  • Total Generated Tokens: 392 tokens

The Ollama response completed much faster and generated fewer tokens than the MLX test.

Comparing the Results

Metric MLX (8-bit) Ollama (4-bit)
Response Time 77.56 s 23.61 s
Generated Tokens 866 392
Model Size ~5.2 GB ~3.4 GB

For this benchmark, Ollama completed the same request approximately 3.2× faster than MLX.

One noticeable difference was token generation. The MLX run produced 866 tokens, while the Ollama run produced 392 tokens. Processing more tokens generally takes more time, which may have contributed to the latency difference observed in this test.

Another difference was model size. The MLX model occupied approximately 5.2 GB on disk, while the Ollama model occupied approximately 3.4 GB.

Notes About This Test

This benchmark reflects a single prompt executed on a base M1 system. Results may vary depending on factors such as:

  • Prompt complexity
  • Model settings
  • Quantization format
  • Hardware configuration
  • Engine implementation
  • Whether streaming is enabled

The purpose of this test was not to determine which framework is universally better, but to compare how these two configurations behaved under the same conditions on the same machine.

Final Takeaways

  1. The Ollama 4-bit configuration responded significantly faster on this base M1 system.
  2. The Ollama run generated fewer tokens than the MLX run.
  3. The model sizes were noticeably different (3.4 GB vs 5.2 GB).
  4. Streaming can improve the user experience by displaying output as it is generated rather than waiting for the full response.

To enable streaming, add the following field to your API request payload:

"stream": true

0 comments: