# LLM Essay Grading Benchmark - Research Data Package

**Paper**: _Evaluating Large Language Models for Essay Scoring and Feedback: An Empirical Comparison_  
**Date**: December 2025  
**Contact**: GradingPal Research Team

---

## Overview

This data package contains all experimental inputs, outputs, and analysis from our evaluation of six large language models for automated essay scoring. The study assessed 15,000+ essay evaluations across five experiments.

## Models Evaluated

| Model             | Provider  | Context Window | Native Reasoning     |
| ----------------- | --------- | -------------- | -------------------- |
| GPT-5.1           | OpenAI    | 128K           | ✅ Supported         |
| GPT-5-mini        | OpenAI    | 128K           | ✅ Supported         |
| Claude 4.5 Sonnet | Anthropic | 200K           | ✅ Extended Thinking |
| Gemini 3 Pro      | Google    | 1M             | ✅ Supported         |
| Gemini 3 Flash    | Google    | 1M             | ✅ Supported         |
| Grok 4.1          | xAI       | 128K           | ❌ Not Supported     |

---

## Directory Structure

```
data_package/
├── README.md                       # This file
├── input/
│   ├── essays/
│   │   ├── persuade_2_sample.csv   # 500 essays (Exp 1, 4, 5)
│   │   └── ellipse_sample.csv      # 500 essays (Exp 2)
│   └── prompts/
│       ├── zero_shot_holistic.txt
│       ├── zero_shot_rubric.txt
│       ├── few_shot_rubric.txt
│       ├── chain_of_thought.txt
│       ├── holistic_rubric.txt
│       └── six_trait_rubric.txt
├── experiment_1_holistic/          # Holistic Scoring Accuracy
├── experiment_2_multi_trait/       # Multi-Trait Scoring
├── experiment_3_feedback/          # Feedback Quality
├── experiment_4_consistency/       # Consistency & Reliability
├── experiment_5_prompting/         # Prompt Engineering
│   ├── results/                    # Strategy results (zero-shot, few-shot, CoT, reasoning)
│   ├── reasoning_traces/           # Native reasoning outputs by model
│   │   ├── gemini-3-pro/           # Thinking summaries, token counts
│   │   ├── gpt-5.1/                # Reasoning tokens
│   │   └── claude-4.5-sonnet/      # Extended thinking content, signatures
│   └── analysis/                   # Strategy comparison metrics
└── figures_publication/            # High-resolution figures
```

---

## Input Data

### Essays

#### `persuade_2_sample.csv`

- **Source**: PERSUADE 2.0 Corpus (Crossley et al., 2022)
- **N**: 500 essays
- **Columns**:
  - `essay_id`: Unique identifier
  - `full_text`: Complete essay text
  - `holistic_score`: Human-assigned score (1-6)
  - `grade_level`: Student grade (6-12)
  - `prompt_name`: Writing prompt identifier

#### `ellipse_sample.csv`

- **Source**: ELLIPSE Corpus (Crossley et al., 2022)
- **N**: 500 essays
- **Columns**:
  - `essay_id`: Unique identifier
  - `full_text`: Complete essay text
  - `cohesion`, `syntax`, `vocabulary`, `phraseology`, `grammar`, `conventions`: Trait scores (1.0-5.0)

### Prompts

All prompt templates used in experiments, exported as plain text files with placeholders for essay content and rubrics.

---

## Experiment Data

### Experiment 1: Holistic Scoring Accuracy

**Objective**: Measure LLM-human agreement on holistic essay scores.

| Folder      | Contents                                                |
| ----------- | ------------------------------------------------------- |
| `results/`  | Per-model JSON files with scores, feedback, reasoning   |
| `analysis/` | `exp1_holistic_summary.csv`, metrics JSON               |
| `figures/`  | QWK comparison, confusion matrices, score distributions |

**Result File Format** (`{model}_results.json`):

```json
{
  "essay_id": "ABC123",
  "holistic_score": 4,
  "feedback": "Your essay demonstrates...",
  "reasoning": "I assigned this score because...",
  "input_tokens": 1250,
  "output_tokens": 450
}
```

### Experiment 2: Multi-Trait Scoring

**Objective**: Evaluate trait-level scoring accuracy across 6 dimensions.

| Folder      | Contents                                      |
| ----------- | --------------------------------------------- |
| `results/`  | Per-model JSON files with trait scores        |
| `analysis/` | `exp2_multi_trait_summary.csv`, per-trait QWK |
| `figures/`  | Trait heatmaps, difficulty rankings           |

**Result File Format**:

```json
{
  "essay_id": "XYZ789",
  "cohesion": 3.5,
  "syntax": 4.0,
  "vocabulary": 3.0,
  "phraseology": 2.5,
  "grammar": 4.0,
  "conventions": 4.5,
  "feedback": "..."
}
```

### Experiment 3: Feedback Quality

**Objective**: Assess pedagogical value of LLM-generated feedback.

| Folder      | Contents                                |
| ----------- | --------------------------------------- |
| `results/`  | Generated feedback, judge evaluations   |
| `analysis/` | Dimension scores, inter-judge agreement |
| `figures/`  | Radar charts, dimension heatmaps        |

**Evaluation Dimensions** (1-5 scale):

- Accuracy, Specificity, Actionability, Tone, Pedagogical Value

### Experiment 4: Consistency & Reliability

**Objective**: Test scoring reproducibility across conditions.

| Folder                 | Contents                                    |
| ---------------------- | ------------------------------------------- |
| `results/identical/`   | Same essay scored 5 times                   |
| `results/semantic/`    | Original vs. paraphrased essays             |
| `results/temperature/` | Scores at T=0.0, 0.5, 1.0                   |
| `analysis/`            | Variance, ICC, exact match metrics          |
| `figures/`             | Consistency comparisons, temperature curves |

### Experiment 5: Prompt Engineering

**Objective**: Compare prompting strategies and native reasoning modes.

| Folder              | Contents                                              |
| ------------------- | ----------------------------------------------------- |
| `results/`          | Zero-shot, few-shot, chain-of-thought, reasoning mode |
| `reasoning_traces/` | Native reasoning outputs (thinking content, tokens)   |
| `analysis/`         | Strategy comparison metrics                           |
| `figures/`          | Strategy comparison charts, improvement visualization |

**Strategies Tested**:

1. Zero-Shot: Rubric only
2. Few-Shot: Rubric + 3 scored exemplars
3. Chain-of-Thought: Step-by-step reasoning instructions
4. Native Reasoning Mode: Provider-specific internal deliberation

**Native Reasoning Mode Configuration**:

| Provider  | Parameter                | Setting |
| --------- | ------------------------ | ------- |
| Google    | `thinkingLevel`          | HIGH    |
| OpenAI    | `reasoning_effort`       | high    |
| Anthropic | `thinking.budget_tokens` | 10000   |
| xAI       | Not supported            | N/A     |

**Key Finding**: Gemini 3 Pro with native reasoning achieved highest precision (46.5% exact match, 96.5% within ±1).

---

## Key Metrics

| Metric      | Description               | Range   |
| ----------- | ------------------------- | ------- |
| QWK         | Quadratic Weighted Kappa  | -1 to 1 |
| MAE         | Mean Absolute Error       | 0+      |
| Exact Match | % scores matching exactly | 0-100%  |
| Adjacent    | % within 1 point          | 0-100%  |
| ICC         | Intraclass Correlation    | 0-1     |

---

## Citation

If you use this data, please cite:

```
@techreport{gradingpal2025llm,
  title={Evaluating Large Language Models for Essay Scoring and Feedback: An Empirical Comparison},
  author={GradingPal Research Team},
  year={2025},
  institution={GradingPal}
}
```

---

## License

This data package is released for research purposes. The underlying essay datasets (PERSUADE 2.0, ELLIPSE) are subject to their original licenses.

---

## Contact

For questions about this data package, contact: research@gradingpal.com
