December 2025 Research Report

State of LLMs inGrading & Feedback

Can AI reliably grade essays and provide feedback? We tested 6 leading models on 1,500+ student essays to find out what works, what doesn't, and what educators should know.

GPTClaudeGeminiGrok
6 Models Tested
1,500+ Essays
5 Experiments
~15,000 Evaluations

By the GradingPal Research Team

01

Research Overview

02

Key Takeaways for Educators

Include 2-3 Graded Examples in Your Prompt

When asking AI to grade, provide a few example essays with your scores and reasoning. Few-shot prompting improved accuracy by 8-12% across all models.

QWK: 0.716 vs 0.672 baselineSee Prompt Strategy

Use AI for First-Pass, Spot-Check Borderlines

Let AI grade the full stack, then manually review essays near grade cutoffs. AI is within 1 point 90%+ of the time—excellent for triage, but edge cases need your expertise.

91-94% within ±1 pointSee Holistic Scoring

Ask AI to Explain Why Suggestions Matter

Add to your prompt: 'For each suggestion, explain why it improves the writing.' All models scored lowest on teaching the 'why' behind their feedback.

Pedagogical Value: 4.1/5 (weakest)See Feedback Quality

Trust Overall Scores, Double-Check Style Comments

Accept holistic grades confidently, but manually verify feedback about voice, style, or word choice. AI excels at structure but struggles with nuanced traits.

Holistic: 0.69 vs Traits: 0.37See Multi-Trait Scoring

What Accuracy Should You Expect?

Exact Match Rate38-48%

AI will exactly match the human score about 4-5 times out of 10 essays.

Within ±1 Point88-94%

AI is within one point of human scores 9 out of 10 times—comparable to human inter-rater reliability.

03Experiment 1

Holistic Essay Scoring

Research Question

Can AI models reliably assign a single holistic score to student essays that aligns with expert human judgment?

Methodology
1

Dataset

500 essays from PERSUADE competition with human expert scores (1-6 scale)

2

Models Tested

GPT-5.1 logoGPT-5.1GPT-5.1 Mini logoGPT-5.1 MiniClaude 4.5 logoClaude 4.5Gemini 3 Pro logoGemini 3 ProGemini 3 Flash logoGemini 3 FlashGrok 4.1 logoGrok 4.1
3

Evaluation

QWK (Quadratic Weighted Kappa), MAE, Exact Match Rate

Model Rankings
By Quadratic Weighted Kappa (QWK)
1Grok 4.1 logoGrok 4.1Best0.689
2Gemini 3 Pro logoGemini 3 Pro0.68
3Claude 4.5 logoClaude 4.50.665
4Gemini 3 Flash logoGemini 3 Flash0.59
5GPT-5.1 logoGPT-5.10.521
Holistic Scoring Performance
Comparing models across key metrics
Real Examples from Our Study
See how AI scored actual student essays compared to human experts
11th Grade

High Performer with Strong Voice

Distance Learning Benefits

Human Score
6/6

"I hate coming to school" is something that I hear from my friends very often. Most teenagers go to a high school. Some find it very difficult and are searching for an alternative... Although some may believe otherwise, students would benefit from attending classes from home because school can be extremely stressful for students with poor mental health...

Grok 4.1 logoGrok 4.1
5/6
-1
Claude 4.5 logoClaude 4.5
4/6
-2
Gemini 3 Flash logoGemini 3 Flash
4/6
-2
Insight

AI undervalued authentic student voice and personal anecdotes, favoring research-style citations over compelling personal examples.

Teacher Tip

Trust your instincts when AI underscores essays with strong personal narratives.

What This Means for Educators

Trust

AI holistic scores are reliable for initial screening and identifying essays that need attention.

Review

Always review essays where AI score differs significantly from your expectation—AI may miss authentic voice.

04Experiment 2

Multi-Trait Scoring

Research Question

Can AI provide accurate scores for specific writing traits like cohesion, grammar, and vocabulary?

AI Performance by Writing Trait
From easiest to hardest for AI to assess accurately

Cohesion

How ideas connect and flow

0.529
Easiest

Conventions

Spelling, punctuation, capitalization

0.435
Easier

Vocabulary

Word choice appropriateness

0.4
Medium

Grammar

Usage and agreement

0.39
Medium

Syntax

0.363
Harder

Phraseology

Phrases, idioms, collocations

0.311
Hardest
Trait Assessment Examples
See how AI assesses specific writing traits compared to human experts
Easiest for AI

Cohesion Assessment

AI
4.5
Human
4.5

First of all, students with poor mental health would greatly benefit... Furthermore, online classes provide flexibility... In conclusion, distance learning offers many advantages.

AI Assessment

Strong transitions ('First of all', 'Furthermore', 'In conclusion') create clear logical flow between paragraphs.

Human Assessment

Excellent paragraph structure with consistent topic sentences and smooth transitions.

Why AI Excels

AI excels at detecting structural patterns like transitions, topic sentences, and logical progression.

Model Performance Summary
Average QWK across all traits
1Gemini 3 Flash logoGemini 3 Flash
0.37
Best Overall
2GPT-5.1 logoGPT-5.1
0.348
3GPT-5.1 Mini logoGPT-5.1 Mini
0.308
4Grok 4.1 logoGrok 4.1
0.264

What This Means for Educators

Use Confidently

AI trait feedback on cohesion and conventions is reliable and can save significant grading time.

Review Carefully

AI struggles with nuanced traits like phraseology—add your own feedback on style and voice.

05Experiment 3

Feedback Quality

Research Question

Is AI-generated feedback actually helpful for students? Which models provide the most pedagogically valuable comments?

Compare AI Feedback Styles
See how each model approaches student feedback differently
Best for Teaching

Gemini 3 Flash logoGemini's Teaching Approach

Strengths

You have gathered relevant evidence from various global contexts, such as the statistics regarding Vauban, Germany, and the smog levels in Paris.

Areas for Improvement

Your essay lacks a clear thesis statement. In the first paragraph, you ask questions but do not state a firm position. Try: "While cars offer convenience, cities should transition to car-free models to combat pollution."

Next Steps

1. Proofread for spelling errors like "gicing" and "partical." 2. Add a clear thesis statement at the end of your introduction.

Provides model sentences to emulateExplains WHY suggestions matterTeaches writing concepts
Feedback Quality Scores
Evaluated by panel of 3 LLM judges across 5 dimensions (scale: 1-5)
GPT-5.1 logoGPT-5.1
Gemini 3 Flash logoGemini 3 Flash
Claude 4.5 logoClaude 4.5
Feedback Quality Leaderboard
Overall scores across all evaluation dimensions
1GPT-5.1 logoGPT-5.1
4.71
Best
2Gemini 3 Flash logoGemini 3 Flash
4.69
3GPT-5.1 Mini logoGPT-5.1 Mini
4.65
4Claude 4.5 logoClaude 4.5
4.59
5Grok 4.1 logoGrok 4.1
4.57

What This Means for Educators

Gemini 3 Flash logoTeaching

Use Gemini for feedback that teaches writing concepts with model sentences students can emulate.

GPT-5.1 logoConfidence

Use GPT for encouraging feedback that builds student confidence while still addressing issues.

Claude 4.5 logoRigor

Use Claude for academic feedback that prepares students for college-level writing expectations.

06Experiment 4

Consistency & Reliability

Research Question

Can you trust AI to give the same score twice? How consistent are these models when grading identical essays, paraphrased content, or under different settings?

Identical Test

Same essay submitted 5 times. Does the AI give the same score every time?

1,440 tests

Semantic Test

Original vs paraphrased versions. Are scores based on ideas or word choice?

576 pairs

Temperature Test

How does the "randomness" setting affect score stability?

711 tests
Test 1: Identical Resubmission
If you submit the same essay 5 times, will you get the same score? Higher is better.

Key Finding

Gemini models achieve 100% exact match rate—they're perfectly deterministic. All models stay within 1 point, so even "inconsistent" models won't wildly misgrade.

Test 2: Semantic Equivalence
Does the AI grade the ideas or the specific words? Original vs paraphrased content compared.

Great News for ELL Students

All models maintain scores within 1 point when the same ideas are expressed differently. This means AI evaluates the quality of thinking, not just vocabulary sophistication.

Test 3: Temperature Sensitivity
Higher temperature = more randomness. Lower variance is better for grading consistency.

Critical for Administrators

Always use temperature 0 for grading. At temperature 1.0, some models (especially Grok) show significant variance. Claude remains most stable across all temperature settings.

Overall Consistency Ranking
Composite score combining all three consistency tests
1
Gemini Flash logoGemini Flash
91.4%
2
Claude 4.5 logoClaude 4.5
84.9%
3
Gemini Pro logoGemini Pro
83.4%
4
GPT-5.1 Mini logoGPT-5.1 Mini
79.2%
5
GPT-5.1 logoGPT-5.1
71.2%
6
Grok 4.1 logoGrok 4.1
55.9%
Consistency in Action
See real examples of how consistency varies across models and tests
Identical Resubmission

Gemini 3 Flash logoPerfect Consistency

Same essay submitted 5 times, same score every time

100% Match Rate

"The advantages of limiting car usage are numerous. In Vauban, Germany, residents have given up their cars..."

5 Submissions:33333

Gemini models produce identical scores when given the same essay multiple times, making them highly predictable for grading workflows.

Teacher Tip: If you need consistent, reproducible grades for record-keeping or appeals, Gemini models won't surprise you with score variations.

What This Means for Educators

For High-Stakes Assessments: Use Gemini models for maximum reproducibility—parents and administrators can't argue with 100% consistency.

For ELL Students: All models fairly evaluate ideas over vocabulary, with 75-94% same-score rates on paraphrased content.

Configuration Matters: Always set temperature to 0 (or lowest available). Higher temps are designed for creative tasks, not evaluation.

Appeals Process: If a student resubmits for regrading, document which model and settings were used. Gemini will give identical results; others may vary by ±1 point.

07Experiment 5

Prompt Engineering Strategy

Research Question

Does how you phrase the prompt matter? We tested four strategies—zero-shot (just rubric), few-shot (rubric + examples), chain-of-thought (explicit reasoning), and reasoning mode (native AI thinking)—to find the optimal approach.

Zero-Shot

Just the rubric, no examples. "Score this essay based on these criteria."

Baseline approach

Few-Shot

Rubric + 3 scored exemplar essays. "Here are examples, now score this one."

🏆 Best overall strategy

Chain-of-Thought

Step-by-step reasoning. "Analyze each criterion, then provide final score."

Similar to zero-shot

Reasoning Mode

Native AI thinking. The model uses its built-in reasoning capabilities.

🔮 Promising for newer models
Strategy Comparison by Model
QWK scores across all four prompting strategies. Green bars (Few-Shot) provide the best overall results.

Key Finding

Few-Shot wins for overall QWK scores. The red dashed line shows the 0.70 QWK threshold—most models exceed it with Few-Shot prompting. Note: Reasoning Mode shows promise for Gemini Pro (see below).

Few-Shot Improvement Over Zero-Shot
How much does adding 3 examples help? Some models benefit dramatically.
Claude 4.5 logoClaude: +12.5%

Claude shows the largest improvement from examples. Always use few-shot with Claude.

GPT-5.1 logoGPT-5.1: +0.6%

GPT is already well-calibrated. You can save tokens by using zero-shot if budget is tight.

Reasoning Mode: A Promising Direction
Native AI reasoning shows strong precision for the newest models—a sign of where AI grading is headed.
Gemini 3 Flash logoGemini 3 Pro + Reasoning Mode
46.5%
Exact Match
96.5%
Within ±1

The highest exact-match rate of any model-strategy combination—when it's right, it nails it perfectly.

What This Means for Educators

As AI models improve their native reasoning capabilities, we expect reasoning mode to become increasingly effective for grading.

Best Performance Ranking (Few-Shot)
Final model ranking when using the optimal prompting strategy
1
Gemini Pro logoGemini Pro
0.800QWK
2
Grok 4.1 logoGrok 4.1
0.797QWK
3
Claude 4.5 logoClaude 4.5
0.793QWK
4
GPT-5.1 logoGPT-5.1
0.722QWK
5
Gemini Flash logoGemini Flash
0.666QWK
6
GPT-5.1 Mini logoGPT-5.1 Mini
0.518QWK

Top 3 Models Cluster Together

Gemini Pro (0.800), Grok (0.797), and Claude (0.793) all achieve nearly identical accuracy. This suggests a ~0.80 QWK ceiling for current LLMs on this grading task.

Prompting Strategies in Action
See how different prompting approaches affect actual grading decisions
Zero-Shot vs Few-Shot

Claude 4.5 logoFew-Shot Learning Success

Providing 3 scored examples dramatically improved Claude's accuracy

+12.5% Improvement
Zero-Shot
Score: 2
Few-Shot
Score: 3
Human Score
Score: 3

With zero-shot, Claude underscored this borderline essay. After seeing 3 calibrated examples, it correctly identified this as meeting the score 3 threshold.

Teacher Tip: If you're using Claude for grading, always include 2-3 scored sample essays in your prompt. It's worth the extra tokens for +12.5% accuracy boost.

What This Means for Educators

Always Use Examples

Include 2-3 scored sample essays in your prompt. This single change can boost accuracy by up to 8.7 percentage points.

Skip Elaborate Prompts

Chain-of-thought and complex reasoning instructions don't help—they cost more and often perform worse.

Choose Your Examples Carefully

Select exemplars that represent each score level clearly. Borderline essays as examples can confuse the model.

Top Models Are Interchangeable

Gemini Pro, Grok, and Claude achieve nearly identical accuracy. Choose based on cost, speed, or other features.

08

For Your Classroom

By Grade Level
Grades 4-6

Use AI for basic mechanics feedback (spelling, punctuation). Review all holistic scores—AI may not recognize developing voice.

Grades 7-9

AI trait feedback on cohesion and conventions is reliable. Use GPT for encouraging tone with struggling writers.

Grades 10-12

Full AI grading is appropriate for most assignments. Use Claude for AP/college prep students who need academic rigor.

By Goal
Build Confidence

GPT-5.1 logoUse GPT—warm, validating tone with positive framing of errors

Teach Writing Principles

Gemini 3 Flash logoUse Gemini—explains WHY and provides model sentences

College Preparation

Claude 4.5 logoUse Claude—academic tone with sophisticated vocabulary

Quick Screening

Grok 4.1 logoUse Grok—best holistic scoring accuracy for initial review

Ready to Try Research-Backed AI Grading & Feedback?

GradingPal uses these research insights to provide reliable, pedagogically-sound feedback for your students. Sign up for free and try it out today!