Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring
Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, Enkelejda Kasneci·November 25, 2024
Summary
The study compared large language models (LLMs) to teacher evaluations for grading German student essays across 10 criteria. Five LLMs, including GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, were assessed. Closed-source GPT models outperformed open-source ones in internal consistency and alignment with human ratings, especially in language-related criteria. The o1 model showed the highest correlation with human assessments (Spearman's 𝑟 = .74) and internal consistency (𝐼𝐶𝐶 = .80). These findings suggest LLMs can aid in reducing teacher workload, particularly for language-focused essays, but require refinement to better assess content quality.
Introduction
Background
Overview of large language models (LLMs)
Importance of teacher evaluations in grading student essays
Objective
To compare the performance of large language models in grading German student essays against teacher evaluations
Method
Data Collection
Selection of essays for evaluation
Criteria for grading essays
Data Preprocessing
Preparation of data for model training and evaluation
Model Assessment
Evaluation of five large language models (GPT-3.5, GPT-4, o1, LLaMA 3-70B, Mixtral 8x7B)
Comparison of models based on internal consistency and alignment with human ratings
Results
Model Performance
Comparison of closed-source vs. open-source models
Performance of o1 model in correlation with human assessments and internal consistency
Findings
Potential of LLMs in reducing teacher workload
Limitations in assessing content quality
Discussion
Implications for Education
Role of LLMs in automated grading systems
Integration of LLMs in educational settings
Future Research
Enhancements for LLMs in content quality assessment
Comparative studies with other models
Conclusion
Summary of Findings
Recap of the study's main results
Recommendations
Guidelines for implementing LLMs in grading systems
Areas for further investigation
Basic info
papers
computation and language
human-computer interaction
artificial intelligence
Advanced features
Insights
Which large language models were compared in the study, and what were their names?
What were the main criteria used in the study to grade German student essays?
Which model showed the highest correlation with human assessments and internal consistency, and what were the respective Spearman's 𝑟 and 𝐼𝐶𝐶 values?
How did the closed-source GPT models perform compared to open-source models in terms of internal consistency and alignment with human ratings?