# Experiment 3: Feedback Quality Analysis Report

**Date**: December 24, 2024  
**Dataset**: PERSUADE 2.0 (48 essays, stratified sample)  
**Evaluations**: 402 total (5 feedback models × 3 judges)

---

## Executive Summary

This experiment evaluated the **quality of AI-generated feedback** using an LLM-as-judge methodology. Five models generated feedback on student essays, which was then evaluated by three different judge models across five dimensions.

### Key Findings

| Rank | Model | Overall Score | Best Dimension |
|------|-------|---------------|----------------|
| 1 | gpt-5.1 | **4.71** | Specificity (4.96) |
| 2 | gemini-3-flash-preview | **4.69** | Specificity (4.91) |
| 3 | gpt-5.1-mini | **4.65** | Specificity (4.99) |
| 4 | claude-4.5-sonnet | **4.59** | Accuracy (4.89) |
| 5 | grok-4.1 | **4.57** | Tone (4.94) |

**Winner**: gpt-5.1 with an overall score of **4.71/5.0**

---

## Methodology

### Phase 1: Feedback Generation
- **Prompt**: Structured feedback with 150-200 word target
- **Structure**: Strengths → Areas for Improvement → Next Steps
- **Models**: 5 models generated feedback for 48 essays each

### Phase 2: Judge Evaluation (Multi-Judge Mode)
- **Judges**: GPT-5.1, Claude-4.5-sonnet, Gemini-3-pro-preview
- **Dimensions Evaluated**:
  1. **Specificity**: Does feedback quote/reference specific essay text?
  2. **Actionability**: Are improvement steps clear and concrete?
  3. **Accuracy**: Is feedback factually correct about the essay?
  4. **Tone**: Is feedback encouraging and age-appropriate?
  5. **Pedagogical Value**: Does feedback teach writing principles?

---

## Results by Dimension

### Dimension Scores

| Model | Specificity | Actionability | Accuracy | Tone | Pedagogical |
|-------|-------------|---------------|----------|------|-------------|
| gpt-5.1 | 4.96 | 4.70 | 4.85 | 4.96 | 4.09 |
| gemini-3-flash-preview | 4.91 | 4.78 | 4.77 | 4.64 | 4.33 |
| gpt-5.1-mini | 4.99 | 4.81 | 4.85 | 4.53 | 4.06 |
| claude-4.5-sonnet | 4.87 | 4.56 | 4.89 | 4.52 | 4.09 |
| grok-4.1 | 4.90 | 4.48 | 4.43 | 4.94 | 4.08 |

### Dimension Leaders

- **Specificity**: gpt-5.1-mini (4.99)
- **Actionability**: gpt-5.1-mini (4.81)
- **Accuracy**: claude-4.5-sonnet (4.89)
- **Tone**: gpt-5.1 (4.96)
- **Pedagogical Value**: gemini-3-flash-preview (4.33)

---

## Visualizations

### Overall Comparison
![Overall Comparison](exp3_overall_comparison.png)

### Dimension Heatmap
![Dimension Heatmap](exp3_dimension_heatmap.png)

### Radar Chart
![Radar Chart](exp3_radar_chart.png)

### Dimension Rankings
![Dimension Rankings](exp3_dimension_rankings.png)

### Judge Comparison
![Judge Comparison](exp3_judge_comparison.png)

---

## Key Insights for Educators

### 1. All Models Excel at Specificity and Accuracy
All models scored **4.8+ on specificity** and **4.6+ on accuracy**, meaning AI feedback consistently:
- Quotes and references specific student text
- Makes factually correct observations about essays

### 2. Pedagogical Value is the Differentiator
The biggest gap between models is in **pedagogical value** (range: 4.06-4.33):
- **Gemini-3-Flash** leads with 4.33, explaining *why* suggestions matter
- **GPT-5.1-mini** trails at 4.06, tending to identify problems without teaching

### 3. Tone Consistency Across Models
All models maintain **encouraging, age-appropriate tone** (4.4-4.9), making AI feedback safe for student consumption.

### 4. Actionability Varies
- **Gemini-3-Flash** and **GPT-5.1-mini** provide the most actionable feedback (4.78-4.81)
- **Grok-4.1** trails slightly (4.48), sometimes being more diagnostic than prescriptive

---

## Recommendations

### For Educators Using AI Feedback:

1. **GPT-5.1** is the best all-around choice for high-quality feedback
2. **Gemini-3-Flash** excels at pedagogical depth - good for teaching-focused contexts
3. All models need human review for **pedagogical value** - consider adding "why this matters" context

### For AI Developers:

1. Focus on improving **pedagogical explanations** - the weakest dimension across all models
2. Consider adding explicit "teaching moments" to feedback prompts
3. Multi-judge evaluation provides robust quality assessment

---

## Technical Details

- **Feedback Generation**: Batch API with structured JSON output
- **Judge Assignment**: Round-robin to avoid self-evaluation bias
- **Scoring**: 1-5 Likert scale with detailed rubrics
- **Agreement**: Cross-judge correlation analyzed (see judge comparison chart)

---

*Report generated automatically from Experiment 3 results.*

---


---

## Deep Dive: Side-by-Side Model Comparisons

This section presents **actual feedback excerpts** from different models on the same student essays, allowing educators to see concrete differences in quality.

---

### Essay 1: Car-Free Cities (Grade 8 Argumentative)

**Student excerpt** (showing typical errors):
> "Who wouldnt want to have a car? Thats a reason we get from desination A to destination B. Vauban, Germany they are gicing up their cars, like why would you give a object that gets you places faster than walking. Thats not smart."

---

#### 🏆 GEMINI-3-FLASH (Pedagogical Value: 4.33)

```
STRENGTHS: You have gathered relevant evidence from various global contexts, such as the statistics regarding Vauban, Germany, and the smog levels in Paris. Your opening hook effectively engages the reader by questioning the practicality of car-free living, specifically when you ask, "what if you dont get to work on time?"

AREAS FOR IMPROVEMENT: First, your essay lacks a clear thesis statement. In the first paragraph, you ask questions but do not state a firm position. Try: "While cars offer convenience, cities should transition to car-free models to combat pollution." Second, there are several typos and grammatical errors. For example, "Vauban, Germany they are gicing up their cars" should be corrected to "In Vauban, Germany, residents are giving up their cars." Finally, ensure your paragraphs transition smoothly; the jump from Paris to President Obama feels abrupt and needs a connecting sentence.

NEXT STEPS: 
1. Proofread for spelling errors like "gicing" and "partical."
2. Add a clear thesis statement at the end of your introduction.
3. Use transition words like "Furthermore" or "In contrast" to link your body paragraphs together.
```

**What makes this effective:**
- ✅ **Teaches the concept**: Explains *why* a thesis matters ("you ask questions but do not state a firm position")
- ✅ **Model sentences**: Provides a complete thesis example the student can emulate
- ✅ **Before/after rewrites**: Shows transformation from error to correction
- ✅ **Specific next steps**: Numbered, actionable items

---

#### 💪 GPT-5.1 (Tone: 4.91 — Highest)

```
**STRENGTHS**: You clearly share your opinion, especially in sentences like, “Thats not smart,” which shows a strong stance. You also use examples from different places, such as “Vauban, Germany” and “Paris enforced a partical driving ban,” which is great evidence to support your argument.

**AREAS FOR IMPROVEMENT**: First, your introduction needs a clearer claim. For example, instead of “Who wouldnt want to have a car?” you could write: “Car-free cities are a bad idea because people need cars to get to work and school on time.” That tells the reader your main argument. Second, some sentences are unclear or incomplete, like “Which mean the kids would have to walk everywhere they go.” This should be connected and corrected: “This means kids and parents would have to walk everywhere they go, which could make them late for work or school.” Finally, watch spelling and punctuation: “partical,” “airr,” “dont,” and missing apostrophes distract from your ideas.

**NEXT STEPS**: 1) Write a clear thesis sentence in the first paragraph. 2) Combine short, choppy sentences into longer, clearer ones. 3) Revise for spelling, capitalization, and correct use of apostrophes (don’t, that’s, wouldn’t).
```

**What makes this effective:**
- ✅ **Warm opening**: "You clearly share your opinion" — validates the student's voice
- ✅ **Positive framing**: "which is great evidence" — encourages continued effort
- ✅ **Student-centered language**: Uses "you" and speaks directly to the learner
- ✅ **Balanced structure**: Equal weight to strengths and improvements

---

#### ⚠️ GROK-4.1 (Actionability: 4.48 — Lowest)

```
**STRENGTHS**  
You effectively use real-world examples to support your points, such as Vauban, Germany, where "70 percent of Vauban's families do not have cars," and Paris's ban that cut "congestion...down 60 percent." These facts add credibility. You also pose engaging questions like "what if you dont get to work on time?" to challenge car-free ideas thoughtfully.

**AREAS FOR IMPROVEMENT**  
1. Spelling and typos weaken readability. Quote: "desination A to destination B" and "gicing up their cars." Fix by proofreading; e.g., rewrite to "destination A to destination B" and "giving up their cars."
2. No clear thesis or structure—essay jumps between examples without a main argument. Start with: "Car-free cities like Vauban are impractical because they limit mobility."
3. Sentence fragments and run-ons confuse readers. Quote: "Vauban, Germany they are gicing up their cars, like why..." Rewrite: "In Vauban, Germany, residents give up their cars. Why abandon something that gets you places faster than walking?"

**NEXT STEPS**  
1. Craft a thesis and outline: intro, 2-3 body paragraphs, conclusion.  
2. Use spellcheck and read aloud to fix errors.  
3. Expand arguments with more evidence and revise to 500+ words.  
Great start—keep questioning ideas! (178 words)
```

**Where this falls short:**
- ⚠️ **Vague advice**: "Craft a thesis and outline" — but *how* should a student do this?
- ⚠️ **Generic steps**: "Use spellcheck" applies to any essay
- ❌ **Truncated**: Feedback cut off at "with more eviden..." — incomplete instruction
- ✅ **Strengths noted**: Good diagnostic identification of issues

---

### Essay 2: Electoral College (Grade 10 Persuasive)

**Student excerpt**:
> "The Electoral College is a system that has been used for many years. I believe this method is unfair... layed peope have a better chance at getting their vote counted..."

---

#### GEMINI-3-FLASH on Electoral College Essay

```
STRENGTHS: You effectively integrated evidence from multiple sources, such as Source #2 and Source #3, to support your claims. Your use of the Romney vs. Obama example shows a good attempt to apply source material to real-world scenarios. AREAS FOR IMPROVEMENT: Your essay contains several run-on sentences that make your arguments difficult to follow. For example, your first paragraph is one long sentence. Try breaking ideas apart: 'I believe this method is unfair. Therefore, I argue against it.' Additionally, there are frequent spelling errors like 'layed' and 'peope.' Proofreading will help maintain your credibility. Finally, your analysis of Source #3 in the second paragraph is slightly confusing; ensure you clearly explain how 'transregional appeal' supports your specific stance. NEXT STEPS: 1. Break any sentence longer than three lines into two separate sentences to improve clarity. 2. Use a spell-checker to fix errors like 'prsident' and 'outdate.' 3. Rewrite your conclusion to include a specific call to action for the State Senator, rather than just restating that the system is 'wrong.'
```

---

#### GPT-5.1-MINI on Electoral College Essay

```
STRENGTHS: You open with a clear claim—“I disagree, I believe that this method is wrong and unfair for the peope”—and you use sources to support points, like quoting source #2 and #3. Your paragraph about swing states shows good focus: “candidates don't spend time in states they know they have no chance of winning.”

AREAS FOR IMPROVEMENT:
1. Quote accuracy and spelling: you wrote “th Electoral College” and “peope.” These typos reduce credibility; correct to “the Electoral College” and “people.”
2. Clarity and grammar: the sentence “which means that in the big states little to no votes are accepted by the Electoral College” is unclear. Rewrite example: “In large states, the winner-take-all system can ignore sizable minority votes.”
3. Evidence and analysis: you quote source #3 but don’t explain how it supports unfairness. Add a sentence connecting regional appeal to unequal voter representation (explain who benefits and who loses).

NEXT STEPS:
1. Fix typos and grammar throughout. 2. Replace unclear sentences with the example rewrite above. 3. Add one paragraph analyzing how each cited source strengthens your thesis.
```

---

### Essay 1 Revisited: Claude-4.5-Sonnet's Approach

#### CLAUDE-4.5-SONNET (Overall: 4.59)

```
**STRENGTHS**: You've engaged with interesting real-world examples like Vauban, Germany and Paris's driving ban, showing research effort. Your mention of "China is known as one of the most polluted cities" demonstrates awareness of global environmental issues. You're also asking thought-provoking questions like "what if you dont get to work on time?"

**AREAS FOR IMPROVEMENT**: First, your thesis needs clarity. "Thats not smart" is too informal and vague. Try instead: "While car-free cities aim to reduce pollution, they create significant transportation challenges for residents." Second, punctuation and spelling need attention. "Thats" should be "That's," "gicing" should be "giving," and "partical" should be "partial." Third, your argument jumps between ideas without clear transitions or organization. Connect your Paris example to Vauban with transitional phrases like "Similarly" or "In contrast."

**NEXT STEPS**: 1) Write a clear thesis statement that takes a definitive position on car-free cities. 2) Organize your body paragraphs with topic sentences that support your thesis. 3) Proofread carefully for spelling, punctuation, and capitalization errors before submitting.
```

**Claude's distinctive style:**
- ✅ **Sophisticated vocabulary**: Uses phrases like "thought-provoking questions" and "transitional phrases"
- ✅ **Academic tone**: More formal than GPT, suitable for older students
- ⚠️ **Slightly less warm**: Professional but less encouraging than GPT-5.1
- ✅ **Clear structure**: Well-organized strengths → improvements → next steps

---


### Key Patterns Observed

| Model | Signature Strength | Watch Out For |
|-------|-------------------|---------------|
| **Gemini-3-Flash** | Explains *why* — teaches concepts | Can be dense for younger students |
| **GPT-5.1** | Warm, encouraging tone | Sometimes less specific on mechanics |
| **GPT-5.1-Mini** | Highly actionable steps | Less pedagogical depth |
| **Claude-4.5-Sonnet** | Sophisticated, academic | Slightly formal for younger grades |
| **Grok-4.1** | Strong diagnosis | Vague on solutions, occasional truncation |

---

### Practical Recommendations for Educators

#### By Grade Level:

| Grade Level | Recommended Model | Why |
|-------------|-------------------|-----|
| **Grades 4-6** | GPT-5.1 | Warmest tone, most encouraging |
| **Grades 7-9** | Gemini-3-Flash | Balances teaching with accessibility |
| **Grades 10-12** | Claude-4.5-Sonnet | Academic tone prepares for college |
| **Quick Diagnostics** | Grok-4.1 | Fast issue identification |

#### By Use Case:

| Goal | Best Choice |
|------|-------------|
| Build student confidence | GPT-5.1 (tone: 4.91) |
| Teach writing principles | Gemini-3-Flash (pedagogical: 4.33) |
| Provide revision checklist | GPT-5.1-Mini (actionability: 4.81) |
| First-pass screening | Any model (all score 4.5+ overall) |

---

### What Judges Said

Here's actual reasoning from the LLM judges evaluating feedback quality:

**On Gemini's Pedagogical Value (Score: 5)**:
> "The feedback explains *why* changes should be made (e.g., 'Fixing them will make your essay stronger' or 'This is vague—add a specific example'). It teaches the concept of using concrete evidence over general statements."

**On GPT-5.1's Tone (Score: 5)**:
> "The tone is constructive and supportive, starting with genuine praise before addressing areas for improvement. It motivates revision without discouraging the student."

**On Grok's Actionability (Score: 4)**:
> "The feedback provides some steps but could be more specific. 'Craft a thesis' is good advice but lacks a concrete example or template the student could follow."

---


---

## Inter-Judge Agreement Analysis

A critical question in LLM-as-judge methodology is: **Do different judges agree on feedback quality?** This section analyzes the consistency between our three judge models.

### Methodology

- **136 feedback samples** were evaluated by multiple judges
- Each sample received scores from 2-3 different judge models
- We calculated pairwise agreement across all judge combinations

---

### Overall Agreement Metrics

| Metric | Value | Interpretation |
|--------|-------|----------------|
| **Mean Pearson r** | 0.30 | Moderate correlation |
| **Mean Absolute Error** | 0.31 points | Judges typically differ by ~0.3 on 5-point scale |
| **Exact Match** | 71.7% | Judges give identical scores ~72% of the time |
| **Within ±1 Point** | 98.2% | Judges almost always agree within 1 point |

**Key Finding**: While exact agreement is moderate (72%), judges **rarely disagree by more than 1 point** (98.2% within ±1). This suggests high practical reliability.

---

### Agreement by Dimension

| Dimension | Pearson r | Exact Match | Within ±1 |
|-----------|-----------|-------------|-----------|
| **Specificity** | 0.19 | 87.2% | 100% |
| **Accuracy** | 0.52 | 72.7% | 90.9% |
| **Pedagogical Value** | 0.28 | 73.1% | 100% |
| **Tone** | 0.29 | 72.5% | 100% |
| **Actionability** | 0.21 | 52.9% | 100% |

**Observations**:

1. **Specificity** has highest exact agreement (87%) — judges easily recognize when feedback quotes student text
2. **Accuracy** has highest correlation (r=0.52) — judges agree on whether feedback is factually correct
3. **Actionability** has lowest exact match (53%) — most subjective dimension; judges differ on what constitutes "clear steps"
4. **All dimensions have 90%+ within ±1** — disagreements are minor, not substantive

---

### Agreement by Judge Pair

| Judge Pair | Pearson r | Exact Match | Within ±1 |
|------------|-----------|-------------|-----------|
| GPT-5.1 vs Claude-4.5 | 0.31 | 76.7% | 98.6% |
| GPT-5.1 vs Gemini-3-Pro | 0.34 | 66.1% | 98.3% |
| Claude-4.5 vs Gemini-3-Pro | 0.24 | 72.2% | 97.6% |

**Observations**:

1. **GPT-5.1 and Claude** show highest exact agreement (77%)
2. **Gemini** correlates slightly less with both other judges
3. All pairs maintain **97%+ within ±1** — no systematic bias between judges

---

### Visualization

![Judge Agreement](exp3_judge_agreement.png)

---

### Implications for Research Validity

#### Strengths of Multi-Judge Approach

1. **Reduces bias**: No single model's preferences dominate
2. **Increases reliability**: 98%+ within ±1 agreement shows consistent standards
3. **Captures nuance**: Minor variations (±1 point) reflect genuine subjectivity in feedback evaluation

#### Limitations

1. **Moderate exact agreement** (72%): Averaging across judges may smooth over meaningful differences
2. **Low actionability agreement** (53%): This dimension may need clearer rubric definition
3. **Correlation vs. Causation**: High agreement doesn't prove judges are "correct" — they may share biases

#### Recommendations for Future Research

1. **Include human judges** for ground-truth comparison
2. **Use weighted averaging** based on judge reliability per dimension
3. **Refine actionability rubric** to improve inter-judge consistency

---

