Model Comparison
Recommendations
Performance Charts
โ ๏ธ VRAM Notes
Reference
๐งโโ๏ธ Grader Comparison
Grader Leaderboard
Each local LLM was given a fresh copy of and graded all 50 responses. Results are compared against the Copilot baseline. Positive bias = grader inflates; negative = deflates.
Grade-by-Grade Grid
Baseline (Copilot) column first, then each grader LLM ranked by accuracy. Cell background shows disagreement severity: ยฑ1 ยฑ2 ยฑ3 ยฑ4โ5
๐ Top Grader Analysis โ deepseek-r1:32b
Manual review of the highest-ranked grader against the Copilot baseline. 66% exact match, 90% within 1 letter.
โ Where it performs well
- Smoke & Grammar โ near-perfect agreement; correctly identifies greeting failures and grammar errors
- Catastrophic failures โ correctly assigns F to timeouts and clear instruction failures (e.g., tutorial instead of greeting)
- Logic near-misses โ correctly recognises wrong answers, even if it softens F to D on partially-reasoned responses
โ Consistent divergences
- Markdown fences in Coding โ upgrades BโA on code that is correct but wrapped in fences, ignoring "code only" instruction (5 mismatches)
- Runtime errors in code โ gave A to codestral's response that would throw
IndentationError; reads intent rather than evaluating syntax - Logic FโD softening โ will not assign F even when the answer and reasoning are completely wrong, if the model made some attempt
- Project Orchestration โ inconsistently applies the 5-element checklist; sometimes misses success criteria that are present or vice versa
๐ Grader Analysis โ deepseek-r1:14b
60% exact match, 84% within 1 letter, bias +0.58 โ ranked #2 by exact match, but the highest bias of all top-tier graders.
โ Where it performs well
- Smoke โ correctly identifies greeting failures and passes clean responses
- Clean Grammar โ agrees with baseline on responses with no meaning issues
- Clear Logic failures โ does eventually penalise wrong answers, just rarely reaches F
โ Consistent divergences
- Unrunnable code โ worst blind spot of all graders: gave A to codestral's
IndentationErrorcode, and DโB on phi3:mini's garbled, truncated, unclosed-paren mess โ reads intent, ignores syntax - Markdown fences โ same as 32b: ignores "code only" rule and upgrades BโA (5 instances)
- phi3:mini over-graded across the board โ Grammar meaning change missed (CโA), Logic complete failure softened (FโC), Project Orchestration step-count violation ignored (CโA)
- Logic FโD/C softening โ never assigns F to a wrong answer if any reasoning was attempted
deepseek-r1:14b's +0.58 bias makes it notably more lenient than its 32b sibling. It is less reliable on nuanced cases and has a meaningful risk of rewarding broken code. Suitable for rough triage only โ not a trustworthy rubric enforcer.
๐ Grader Analysis โ phi4:14b
60% exact match, 88% within 1 letter, bias +0.32 โ same exact-match rank as deepseek-r1:14b but notably better calibration and stronger within-1 rate.
โ Where it performs well
- Detects runtime errors in code โ the only local grader to flag codestral's indentation error in its notes, earning a B instead of A (still too lenient, but meaningfully better than the others)
- Smoke & Grammar โ generally agrees with baseline on clear pass/fail cases
- Balanced Logic grading โ FโD softening is present but it does occasionally go stricter than baseline (qwen2.5-coder:7b DโF, a defensible call)
- Low bias for a 14b model โ +0.32 matches deepseek-r1:32b, the best-ranked grader
โ Consistent divergences
- Markdown fences in Coding โ same blind spot as all r1 models: BโA when code is correct but wrapped in fences
- mistral/Coding DโB โ misses the fundamental logic error (ignores the input list, generates all primes to max); focuses on style issues instead
- phi3:mini/Project Orchestration CโA โ ignores the 19+ step count violation entirely
- Occasionally over-strict โ mixtral/Grammar AโB and deepseek-r1:32b/Proj Orch AโB suggest slight over-criticism on well-formed responses, introducing noise in the other direction
phi4:14b is the most well-rounded local grader in the 14b class. Its code-error awareness is a genuine advantage over the r1 models. If deepseek-r1:32b is unavailable or too slow, phi4:14b is the recommended fallback โ particularly given its VRAM efficiency on this machine.