# Experiment 4: Consistency & Reliability Analysis

## Overview

This experiment evaluates LLM grading consistency across three dimensions:
1. **Identical Resubmission** - Same essay graded 5 times
2. **Semantic Equivalence** - Original vs paraphrased essays
3. **Temperature Sensitivity** - Variance at different temperature settings

## Results Summary

### Test 1: Identical Resubmission

| Model | Exact Match | Mean Variance | Within 1pt |
|-------|-------------|---------------|------------|
| gemini-3-pro-preview | 100.0% | 0.0000 | 100.0% |
| gemini-3-flash-preview | 100.0% | 0.0000 | 100.0% |
| claude-4.5-sonnet | 91.7% | 0.0150 | 100.0% |
| gpt-5.1 | 79.2% | 0.0350 | 100.0% |
| grok-4.1 | 62.5% | 0.0750 | 100.0% |
| gpt-5.1-mini | 50.0% | 0.0900 | 100.0% |

**Key Finding:** Gemini models achieve 100% exact match rate, indicating perfect determinism.

### Test 2: Semantic Equivalence

| Model | Exact Match | Mean Diff | Max Diff |
|-------|-------------|-----------|----------|
| gemini-3-flash-preview | 93.8% | 0.062 | 1 |
| gemini-3-pro-preview | 89.6% | 0.104 | 1 |
| gpt-5.1-mini | 87.5% | 0.125 | 1 |
| grok-4.1 | 85.4% | 0.146 | 1 |
| claude-4.5-sonnet | 83.3% | 0.167 | 1 |
| gpt-5.1 | 75.0% | 0.250 | 1 |

**Key Finding:** All models maintain scores within 1 point between original and paraphrased versions.

### Test 3: Temperature Sensitivity

| Model | Temp 0.0 Var | Temp 0.5 Var | Temp 1.0 Var |
|-------|--------------|--------------|--------------|
| claude-4.5-sonnet | 0.0106 | 0.0280 | 0.0407 |
| gemini-3-flash-preview | 0.0000 | 0.0294 | 0.0392 |
| gemini-3-pro-preview | 0.0000 | 0.0196 | 0.0785 |
| gpt-5.1 | 0.0514 | 0.0313 | 0.0815 |
| grok-4.1 | 0.0808 | 0.1304 | 0.1605 |

**Key Finding:** Higher temperature consistently increases variance across all models.

## Overall Consistency Ranking

1. **gemini-3-flash-preview** - Composite Score: 91.4%
2. **claude-4.5-sonnet** - Composite Score: 84.9%
3. **gemini-3-pro-preview** - Composite Score: 83.4%
4. **gpt-5.1-mini** - Composite Score: 79.2%
5. **gpt-5.1** - Composite Score: 71.2%
6. **grok-4.1** - Composite Score: 55.9%

## Visualizations

1. `exp4_identical_comparison.png` - Identical test results
2. `exp4_semantic_comparison.png` - Semantic equivalence results
3. `exp4_temperature_sensitivity.png` - Temperature sensitivity analysis
4. `exp4_overall_ranking.png` - Overall consistency ranking
5. `exp4_heatmap.png` - Comprehensive metrics heatmap

## Conclusions

1. **Most Consistent Model:** gemini-3-flash-preview demonstrates the highest overall consistency
2. **Determinism:** Gemini models are perfectly deterministic at temperature 0.0
3. **Semantic Robustness:** All models maintain reasonable consistency when grading paraphrased content
4. **Temperature Impact:** All models show increased variance at higher temperatures

## Data Summary

- **Identical Test Results:** 1440
- **Semantic Test Results:** 576
- **Temperature Test Results:** 711

## Data Files

- `data/results/consistency_identical_complete/` - Identical test raw results
- `data/results/consistency_semantic_complete/` - Semantic test raw results  
- `data/results/consistency_temperature_complete/` - Temperature test raw results
- `data/analysis/experiment_4/` - All analysis outputs
