Ollama LLM Benchmarks

Model Comparison

Recommendations

Performance Charts

Reference

🧑‍⚖️ Grader Comparison

Grader Leaderboard

Each local LLM was given a fresh copy of and graded all 50 responses. Results are compared against the Copilot baseline. Positive bias = grader inflates; negative = deflates.

Grade-by-Grade Grid

Baseline (Copilot) column first, then each grader LLM ranked by accuracy. Cell background shows disagreement severity: ±1 ±2 ±3 ±4–5

🔍 Top Grader Analysis — deepseek-r1:32b

Manual review of the highest-ranked grader against the Copilot baseline. 66% exact match, 90% within 1 letter.

Verdict: Useful but not a drop-in replacement — consistently inflated (+0.30 bias) and unreliable on code formatting rules.

✅ Where it performs well

Smoke & Grammar — near-perfect agreement; correctly identifies greeting failures and grammar errors
Catastrophic failures — correctly assigns F to timeouts and clear instruction failures (e.g., tutorial instead of greeting)
Logic near-misses — correctly recognises wrong answers, even if it softens F to D on partially-reasoned responses

❌ Consistent divergences

Markdown fences in Coding — upgrades B→A on code that is correct but wrapped in fences, ignoring "code only" instruction (5 mismatches)
Runtime errors in code — gave A to codestral's response that would throw IndentationError; reads intent rather than evaluating syntax
Logic F→D softening — will not assign F even when the answer and reasoning are completely wrong, if the model made some attempt
Project Orchestration — inconsistently applies the 5-element checklist; sometimes misses success criteria that are present or vice versa

🔍 Grader Analysis — deepseek-r1:14b

60% exact match, 84% within 1 letter, bias +0.58 — ranked #2 by exact match, but the highest bias of all top-tier graders.

Verdict: Moderately reliable on simple tasks but significantly too lenient — the highest bias of the competitive graders, and major blind spots around unrunnable code and step-count rules.

✅ Where it performs well

Smoke — correctly identifies greeting failures and passes clean responses
Clean Grammar — agrees with baseline on responses with no meaning issues
Clear Logic failures — does eventually penalise wrong answers, just rarely reaches F

❌ Consistent divergences

Unrunnable code — worst blind spot of all graders: gave A to codestral's IndentationError code, and D→B on phi3:mini's garbled, truncated, unclosed-paren mess — reads intent, ignores syntax
Markdown fences — same as 32b: ignores "code only" rule and upgrades B→A (5 instances)
phi3:mini over-graded across the board — Grammar meaning change missed (C→A), Logic complete failure softened (F→C), Project Orchestration step-count violation ignored (C→A)
Logic F→D/C softening — never assigns F to a wrong answer if any reasoning was attempted

deepseek-r1:14b's +0.58 bias makes it notably more lenient than its 32b sibling. It is less reliable on nuanced cases and has a meaningful risk of rewarding broken code. Suitable for rough triage only — not a trustworthy rubric enforcer.

🔍 Grader Analysis — phi4:14b

60% exact match, 88% within 1 letter, bias +0.32 — same exact-match rank as deepseek-r1:14b but notably better calibration and stronger within-1 rate.

Verdict: The most balanced local grader — lower bias than 14b, better code-error detection than both r1 models, but shares the markdown-fences blind spot and struggles with the ≤12 steps rule.

✅ Where it performs well

Detects runtime errors in code — the only local grader to flag codestral's indentation error in its notes, earning a B instead of A (still too lenient, but meaningfully better than the others)
Smoke & Grammar — generally agrees with baseline on clear pass/fail cases
Balanced Logic grading — F→D softening is present but it does occasionally go stricter than baseline (qwen2.5-coder:7b D→F, a defensible call)
Low bias for a 14b model — +0.32 matches deepseek-r1:32b, the best-ranked grader

❌ Consistent divergences

Markdown fences in Coding — same blind spot as all r1 models: B→A when code is correct but wrapped in fences
mistral/Coding D→B — misses the fundamental logic error (ignores the input list, generates all primes to max); focuses on style issues instead
phi3:mini/Project Orchestration C→A — ignores the 19+ step count violation entirely
Occasionally over-strict — mixtral/Grammar A→B and deepseek-r1:32b/Proj Orch A→B suggest slight over-criticism on well-formed responses, introducing noise in the other direction

phi4:14b is the most well-rounded local grader in the 14b class. Its code-error awareness is a genuine advantage over the r1 models. If deepseek-r1:32b is unavailable or too slow, phi4:14b is the recommended fallback — particularly given its VRAM efficiency on this machine.

🦙 Ollama LLM Benchmarks

Model Comparison

Recommendations

Performance Charts

⚠️ VRAM Notes

Reference

🧑‍⚖️ Grader Comparison

Grader Leaderboard

Grade-by-Grade Grid

🔍 Top Grader Analysis — deepseek-r1:32b

✅ Where it performs well

❌ Consistent divergences

🔍 Grader Analysis — deepseek-r1:14b

✅ Where it performs well

❌ Consistent divergences

🔍 Grader Analysis — phi4:14b

✅ Where it performs well

❌ Consistent divergences