Grading Massive Open Online Courses Using Large Language Models
Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger·June 16, 2024
Summary
This study investigates the potential of large language models (LLMs), specifically GPT-4 and GPT-3.5, to replace peer grading in massive open online courses (MOOCs) by using zero-shot chain-of-thought prompts with instructor input. Results show that GPT-4, when guided by instructors, outperforms peer grading in aligning with instructor grading, especially in courses with clear rubrics. The study compares LLM grading to instructor and peer assessments in astronomy, astrobiology, and history/philosophy of astronomy, finding no significant differences in advanced subjects. However, LLMs struggle with critical thinking and consistency in long or short answers. The research highlights the need for refining LLM grading methods, incorporating human input, and further research to ensure quality in complex disciplines. The study suggests that LLMs have promise for enhancing MOOC grading but calls for a balance between automation and human oversight.
Introduction
Background
Emergence of large language models in education
Challenges in peer grading in MOOCs
Objective
To evaluate GPT-4 and GPT-3.5 for replacing peer grading in MOOCs
Assess their performance with instructor guidance and clear rubrics
Method
Data Collection
Selection of courses: astronomy, astrobiology, and history/philosophy of astronomy
Sample of student submissions and instructor grades
Zero-shot chain-of-thought prompts with instructor input
Data Preprocessing
Cleaning and standardization of student responses
Comparison dataset: instructor and peer grading scores
Evaluation
LLM Grading Performance
Alignment with instructor grading
Effectiveness in different subjects
Critical Thinking and Consistency
Analysis of long and short answer responses
Human Input and Refinement
Role of instructor guidance in performance
Complexity and Quality
Assessment in advanced subjects
Limitations in complex disciplines
Results
GPT-4's superiority with instructor guidance
Comparison: LLM vs. instructor and peer grading
Performance across disciplines
Discussion
Strengths and weaknesses of LLM grading
Need for refining LLM methods
Human oversight in maintaining quality
Conclusion
Promise of LLMs in enhancing MOOC grading
Recommendations for future research and implementation
Balancing automation and human involvement in grading process
Basic info
papers
computation and language
artificial intelligence
Advanced features
Insights
In which subjects did the study find no significant differences between LLM grading and instructor/peer assessments?
What type of models does the study focus on for potential MOOC grading replacement?
How does GPT-4 perform compared to peer grading when guided by instructors?
What challenges does GPT-4 face in tasks requiring critical thinking and consistency?