Grading Massive Open Online Courses Using Large Language Models

Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger·June 16, 2024

Summary

This study investigates the potential of large language models (LLMs), specifically GPT-4 and GPT-3.5, to replace peer grading in massive open online courses (MOOCs) by using zero-shot chain-of-thought prompts with instructor input. Results show that GPT-4, when guided by instructors, outperforms peer grading in aligning with instructor grading, especially in courses with clear rubrics. The study compares LLM grading to instructor and peer assessments in astronomy, astrobiology, and history/philosophy of astronomy, finding no significant differences in advanced subjects. However, LLMs struggle with critical thinking and consistency in long or short answers. The research highlights the need for refining LLM grading methods, incorporating human input, and further research to ensure quality in complex disciplines. The study suggests that LLMs have promise for enhancing MOOC grading but calls for a balance between automation and human oversight.

Key findings

4

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to investigate the potential of Large Language Models (LLMs) in grading assignments in Massive Open Online Courses (MOOCs) and assess the feasibility of replacing peer grading with LLMs to automate the grading process . This study addresses the challenge of delivering personalized and constructive feedback to a large number of online students, particularly in MOOCs, where peer grading is commonly used but its reliability and validity are often questioned . The research explores the use of LLMs, specifically GPT-4 and GPT-3.5, in grading assignments across different MOOC subjects, focusing on three distinct prompting strategies to evaluate the grading performance of these models . The paper introduces innovative approaches to enhance the grading process in online education, emphasizing the potential benefits of using LLMs for grading assignments in MOOCs .


What scientific hypothesis does this paper seek to validate?

This paper aims to validate the potential of Large Language Models (LLMs) in grading assignments in Massive Open Online Courses (MOOCs) and assess the feasibility of replacing peer grading with LLMs to enhance the grading process in online learning . The study focuses on utilizing LLMs like GPT-4 and GPT-3.5 to evaluate student assignments in various MOOC subjects, such as Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy, by employing different prompting strategies . The goal is to provide more personalized and automated grading feedback to improve the online learning experience for a global audience .


What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper proposes the utilization of Large Language Models (LLMs) like GPT-4 and GPT-3.5 to automate the grading process in Massive Open Online Courses (MOOCs) as a substitute for peer grading . The study focuses on refining the alignment between LLM-assigned grades and instructor-assigned grades, especially in advanced disciplines like Philosophy and Mathematics that require robust reasoning abilities . The research aims to establish more personalized and automated grading feedback to enhance the online learning experience for millions of students globally by investigating the potential of LLMs in handling the task of grading assignments in MOOCs .

The paper introduces three distinct prompting strategies based on the zero-shot chain-of-thought (ZCoT) technique to assess the grading performance of LLMs in three MOOC subjects: Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy . These prompting strategies include:

  1. ZCoT with instructor-provided correct answers.
  2. ZCoT with instructor-provided correct answers and rubrics.
  3. ZCoT with instructor-provided correct answers and LLM-generated rubrics based on the correct answers .

The study evaluates the effectiveness of each prompt using GPT-4 and GPT-3.5, analyzing the alignment of scores with instructor evaluations. The findings suggest that GPT-4, when using ZCoT with instructor-provided answers and rubrics, outperforms peer grading in terms of score alignment with instructors . Additionally, the paper highlights that LLMs like GPT-3.5 and GPT-4 provide subject-appropriate explanations specific enough to justify point deductions, offering consistent and comprehensive feedback compared to peer graders . The paper proposes the use of Large Language Models (LLMs) like GPT-4 and GPT-3.5 to automate the grading process in Massive Open Online Courses (MOOCs) as an alternative to peer grading . The study focuses on refining the alignment between LLM-assigned grades and instructor-assigned grades, particularly in advanced disciplines such as Philosophy and Mathematics that require robust reasoning abilities . The research aims to enhance the grading methodologies for improved congruence with instructor evaluations .

The paper introduces three distinct prompting strategies based on the zero-shot chain-of-thought (ZCoT) technique to evaluate the grading performance of LLMs in three MOOC subjects: Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy . These prompting strategies include:

  1. ZCoT with instructor-provided correct answers.
  2. ZCoT with instructor-provided correct answers and rubrics.
  3. ZCoT with instructor-provided correct answers and LLM-generated rubrics based on the correct answers .

The study evaluates the effectiveness of each prompt using GPT-4 and GPT-3.5, analyzing the alignment of scores with instructor evaluations. The findings suggest that GPT-4, when using ZCoT with instructor-provided answers and rubrics, outperforms peer grading in terms of score alignment with instructors . Additionally, LLMs like GPT-3.5 and GPT-4 provide subject-appropriate explanations specific enough to justify point deductions, offering consistent and comprehensive feedback compared to peer graders .


Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies exist in the field, with noteworthy researchers contributing to this topic. Notable researchers include Shahriar Golchin and Mihai Surdeanu, who have worked on data contamination in large language models . Another significant researcher is Chris Impey, who has authored works on astronomy, astrobiology, and massive open online classes . Additionally, Ashish Vaswani, Noam Shazeer, and other researchers have made contributions to neural architectures for short answer scoring .

The key to the solution mentioned in the paper revolves around the use of large language models (LLMs) to replace peer grading in massive open online courses (MOOCs). The study developed three distinct prompts based on the zero-shot chain-of-thought (ZCoT) prompting technique, each with varying in-context information. These methods include ZCoT with instructor-provided correct answers, ZCoT with instructor-provided correct answers and rubrics, as well as ZCoT with instructor-provided correct answers and LLM-generated rubrics. The effectiveness of each prompt was evaluated using LLMs like GPT-4 and GPT-3.5 to grade assignments from various MOOCs .


How were the experiments in the paper designed?

The experiments in the paper were designed to analyze the grading performance of two large language models (LLMs), GPT-4 and GPT-3.5, in three Massive Open Online Courses (MOOCs): Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy . The study utilized a subset of 10 student writing assignments per course, with each course containing multiple assignments sourced from a proprietary repository to prevent data contamination . The grading process involved instructor grading by university professors who are experts in the relevant fields, with the grades considered as the ground truth. The grading was double-blind, ensuring instructors were unaware of peer grading and students' information . Peer grading was also incorporated, where each writing assignment was evaluated by three to four randomly selected classmates using grading rubrics provided by the instructor, and the final grade was determined by calculating the median of scores from peer graders . The experiments employed three distinct prompting strategies based on the zero-shot chain-of-thought (ZCoT) technique, which included ZCoT with instructor-provided correct answers, ZCoT integrated with both instructor-provided correct answers and rubrics, and ZCoT coupled with instructor-provided correct answers and LLM-generated rubrics . The study aimed to assess the feasibility of using LLMs to replace peer grading in MOOCs, focusing on enhancing personalized and automated grading feedback to improve the online learning experience for students globally .


What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is sourced from a proprietary repository . The code used in the study is not mentioned to be open source in the provided context.


Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide substantial support for the scientific hypotheses that need to be verified . The study conducted a thorough analysis of grading methods using Large Language Models (LLMs) in Massive Open Online Courses (MOOCs) across various subjects like Astrobiology, Introductory Astronomy, and History & Philosophy of Astronomy . The research methodology involved comparing LLM-assigned grades with instructor-assigned grades as the ground truth, ensuring a robust evaluation process .

The experiments demonstrated that LLMs, particularly GPT-4, performed well in aligning with instructor-assigned grades, even outperforming peer grading in some instances . This indicates the effectiveness of LLMs in generating grades that closely match the standards set by instructors, showcasing their potential for automating grading processes in educational contexts . The study also highlighted the challenges faced by LLMs, such as bias towards middle scores and difficulties with extremely short or long student responses .

Quantitatively, the analysis compared average scores produced by LLMs with peer grading as the baseline and instructor-assigned grades as the ground truth, aiming to identify the most effective prompt types and corresponding LLMs . The results indicated that LLM-assigned scores were generally higher and more generous than instructor-assigned grades, but they aligned more closely with the instructor's grades compared to peer grading, especially in the History and Philosophy of Astronomy course . The study also utilized bootstrap resampling to evaluate potential bias in LLM grading, providing insights into the reliability and consistency of LLM grading .

Overall, the experiments and results in the paper offer strong support for the scientific hypotheses by demonstrating the effectiveness of LLMs in grading student assignments in MOOCs and their alignment with instructor-assigned grades, despite encountering some challenges that need to be addressed for further improvement .


What are the contributions of this paper?

The paper on "Grading Massive Open Online Courses Using Large Language Models" makes several key contributions:

  • It analyzes student writing assignments in MOOCs using GPT-4 and GPT-3.5, comparing the grades generated by these models with instructor-assigned grades .
  • The study highlights that GPT-4 generally outperforms GPT-3.5 in generating grades that align closely with instructor-assigned grades, particularly in the context of the History and Philosophy of Astronomy course .
  • It introduces the concept of using Large Language Models (LLMs) to generate rubrics, suggesting that LLMs can produce improved rubrics compared to those provided by instructors due to their extensive interdisciplinary knowledge .
  • The research emphasizes the potential of LLM grading to automate the grading process in MOOCs, aiming to refine grading methodologies for better alignment with instructor evaluations, especially in advanced disciplines like Philosophy and Mathematics .
  • The paper provides insights into the use of peer grading, instructor grading, and bootstrap resampling techniques to evaluate student assignments, ensuring fair and accurate assessment in online courses .

What work can be continued in depth?

Further research can be conducted to refine grading methodologies for improved congruence with instructor evaluations, especially in advanced disciplines like Philosophy and Mathematics that require robust reasoning abilities . Additionally, exploring the potential of large language models (LLMs) in handling the task of grading assignments in Massive Open Online Courses (MOOCs) and replacing the current peer grading system with LLMs could be an area of continued study to enhance personalized and automated grading feedback for online learners globally . This research could focus on developing more advanced prompting strategies and techniques to optimize the grading performance of LLMs, such as incorporating instructor-provided correct answers and rubrics, as well as exploring the use of human-in-the-loop frameworks for automated short answer scoring to balance cost and quality .

Tables

2

Introduction
Background
Emergence of large language models in education
Challenges in peer grading in MOOCs
Objective
To evaluate GPT-4 and GPT-3.5 for replacing peer grading in MOOCs
Assess their performance with instructor guidance and clear rubrics
Method
Data Collection
Selection of courses: astronomy, astrobiology, and history/philosophy of astronomy
Sample of student submissions and instructor grades
Zero-shot chain-of-thought prompts with instructor input
Data Preprocessing
Cleaning and standardization of student responses
Comparison dataset: instructor and peer grading scores
Evaluation
LLM Grading Performance
Alignment with instructor grading
Effectiveness in different subjects
Critical Thinking and Consistency
Analysis of long and short answer responses
Human Input and Refinement
Role of instructor guidance in performance
Complexity and Quality
Assessment in advanced subjects
Limitations in complex disciplines
Results
GPT-4's superiority with instructor guidance
Comparison: LLM vs. instructor and peer grading
Performance across disciplines
Discussion
Strengths and weaknesses of LLM grading
Need for refining LLM methods
Human oversight in maintaining quality
Conclusion
Promise of LLMs in enhancing MOOC grading
Recommendations for future research and implementation
Balancing automation and human involvement in grading process
Basic info
papers
computation and language
artificial intelligence
Advanced features
Insights
In which subjects did the study find no significant differences between LLM grading and instructor/peer assessments?
What challenges does GPT-4 face in tasks requiring critical thinking and consistency?
What type of models does the study focus on for potential MOOC grading replacement?
How does GPT-4 perform compared to peer grading when guided by instructors?

Grading Massive Open Online Courses Using Large Language Models

Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger·June 16, 2024

Summary

This study investigates the potential of large language models (LLMs), specifically GPT-4 and GPT-3.5, to replace peer grading in massive open online courses (MOOCs) by using zero-shot chain-of-thought prompts with instructor input. Results show that GPT-4, when guided by instructors, outperforms peer grading in aligning with instructor grading, especially in courses with clear rubrics. The study compares LLM grading to instructor and peer assessments in astronomy, astrobiology, and history/philosophy of astronomy, finding no significant differences in advanced subjects. However, LLMs struggle with critical thinking and consistency in long or short answers. The research highlights the need for refining LLM grading methods, incorporating human input, and further research to ensure quality in complex disciplines. The study suggests that LLMs have promise for enhancing MOOC grading but calls for a balance between automation and human oversight.
Mind map
Limitations in complex disciplines
Assessment in advanced subjects
Effectiveness in different subjects
Alignment with instructor grading
Complexity and Quality
Role of instructor guidance in performance
Human Input and Refinement
Analysis of long and short answer responses
Critical Thinking and Consistency
LLM Grading Performance
Comparison dataset: instructor and peer grading scores
Cleaning and standardization of student responses
Zero-shot chain-of-thought prompts with instructor input
Sample of student submissions and instructor grades
Selection of courses: astronomy, astrobiology, and history/philosophy of astronomy
Assess their performance with instructor guidance and clear rubrics
To evaluate GPT-4 and GPT-3.5 for replacing peer grading in MOOCs
Challenges in peer grading in MOOCs
Emergence of large language models in education
Balancing automation and human involvement in grading process
Recommendations for future research and implementation
Promise of LLMs in enhancing MOOC grading
Human oversight in maintaining quality
Need for refining LLM methods
Strengths and weaknesses of LLM grading
Performance across disciplines
Comparison: LLM vs. instructor and peer grading
GPT-4's superiority with instructor guidance
Evaluation
Data Preprocessing
Data Collection
Objective
Background
Conclusion
Discussion
Results
Method
Introduction
Outline
Introduction
Background
Emergence of large language models in education
Challenges in peer grading in MOOCs
Objective
To evaluate GPT-4 and GPT-3.5 for replacing peer grading in MOOCs
Assess their performance with instructor guidance and clear rubrics
Method
Data Collection
Selection of courses: astronomy, astrobiology, and history/philosophy of astronomy
Sample of student submissions and instructor grades
Zero-shot chain-of-thought prompts with instructor input
Data Preprocessing
Cleaning and standardization of student responses
Comparison dataset: instructor and peer grading scores
Evaluation
LLM Grading Performance
Alignment with instructor grading
Effectiveness in different subjects
Critical Thinking and Consistency
Analysis of long and short answer responses
Human Input and Refinement
Role of instructor guidance in performance
Complexity and Quality
Assessment in advanced subjects
Limitations in complex disciplines
Results
GPT-4's superiority with instructor guidance
Comparison: LLM vs. instructor and peer grading
Performance across disciplines
Discussion
Strengths and weaknesses of LLM grading
Need for refining LLM methods
Human oversight in maintaining quality
Conclusion
Promise of LLMs in enhancing MOOC grading
Recommendations for future research and implementation
Balancing automation and human involvement in grading process
Key findings
4

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to investigate the potential of Large Language Models (LLMs) in grading assignments in Massive Open Online Courses (MOOCs) and assess the feasibility of replacing peer grading with LLMs to automate the grading process . This study addresses the challenge of delivering personalized and constructive feedback to a large number of online students, particularly in MOOCs, where peer grading is commonly used but its reliability and validity are often questioned . The research explores the use of LLMs, specifically GPT-4 and GPT-3.5, in grading assignments across different MOOC subjects, focusing on three distinct prompting strategies to evaluate the grading performance of these models . The paper introduces innovative approaches to enhance the grading process in online education, emphasizing the potential benefits of using LLMs for grading assignments in MOOCs .


What scientific hypothesis does this paper seek to validate?

This paper aims to validate the potential of Large Language Models (LLMs) in grading assignments in Massive Open Online Courses (MOOCs) and assess the feasibility of replacing peer grading with LLMs to enhance the grading process in online learning . The study focuses on utilizing LLMs like GPT-4 and GPT-3.5 to evaluate student assignments in various MOOC subjects, such as Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy, by employing different prompting strategies . The goal is to provide more personalized and automated grading feedback to improve the online learning experience for a global audience .


What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper proposes the utilization of Large Language Models (LLMs) like GPT-4 and GPT-3.5 to automate the grading process in Massive Open Online Courses (MOOCs) as a substitute for peer grading . The study focuses on refining the alignment between LLM-assigned grades and instructor-assigned grades, especially in advanced disciplines like Philosophy and Mathematics that require robust reasoning abilities . The research aims to establish more personalized and automated grading feedback to enhance the online learning experience for millions of students globally by investigating the potential of LLMs in handling the task of grading assignments in MOOCs .

The paper introduces three distinct prompting strategies based on the zero-shot chain-of-thought (ZCoT) technique to assess the grading performance of LLMs in three MOOC subjects: Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy . These prompting strategies include:

  1. ZCoT with instructor-provided correct answers.
  2. ZCoT with instructor-provided correct answers and rubrics.
  3. ZCoT with instructor-provided correct answers and LLM-generated rubrics based on the correct answers .

The study evaluates the effectiveness of each prompt using GPT-4 and GPT-3.5, analyzing the alignment of scores with instructor evaluations. The findings suggest that GPT-4, when using ZCoT with instructor-provided answers and rubrics, outperforms peer grading in terms of score alignment with instructors . Additionally, the paper highlights that LLMs like GPT-3.5 and GPT-4 provide subject-appropriate explanations specific enough to justify point deductions, offering consistent and comprehensive feedback compared to peer graders . The paper proposes the use of Large Language Models (LLMs) like GPT-4 and GPT-3.5 to automate the grading process in Massive Open Online Courses (MOOCs) as an alternative to peer grading . The study focuses on refining the alignment between LLM-assigned grades and instructor-assigned grades, particularly in advanced disciplines such as Philosophy and Mathematics that require robust reasoning abilities . The research aims to enhance the grading methodologies for improved congruence with instructor evaluations .

The paper introduces three distinct prompting strategies based on the zero-shot chain-of-thought (ZCoT) technique to evaluate the grading performance of LLMs in three MOOC subjects: Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy . These prompting strategies include:

  1. ZCoT with instructor-provided correct answers.
  2. ZCoT with instructor-provided correct answers and rubrics.
  3. ZCoT with instructor-provided correct answers and LLM-generated rubrics based on the correct answers .

The study evaluates the effectiveness of each prompt using GPT-4 and GPT-3.5, analyzing the alignment of scores with instructor evaluations. The findings suggest that GPT-4, when using ZCoT with instructor-provided answers and rubrics, outperforms peer grading in terms of score alignment with instructors . Additionally, LLMs like GPT-3.5 and GPT-4 provide subject-appropriate explanations specific enough to justify point deductions, offering consistent and comprehensive feedback compared to peer graders .


Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies exist in the field, with noteworthy researchers contributing to this topic. Notable researchers include Shahriar Golchin and Mihai Surdeanu, who have worked on data contamination in large language models . Another significant researcher is Chris Impey, who has authored works on astronomy, astrobiology, and massive open online classes . Additionally, Ashish Vaswani, Noam Shazeer, and other researchers have made contributions to neural architectures for short answer scoring .

The key to the solution mentioned in the paper revolves around the use of large language models (LLMs) to replace peer grading in massive open online courses (MOOCs). The study developed three distinct prompts based on the zero-shot chain-of-thought (ZCoT) prompting technique, each with varying in-context information. These methods include ZCoT with instructor-provided correct answers, ZCoT with instructor-provided correct answers and rubrics, as well as ZCoT with instructor-provided correct answers and LLM-generated rubrics. The effectiveness of each prompt was evaluated using LLMs like GPT-4 and GPT-3.5 to grade assignments from various MOOCs .


How were the experiments in the paper designed?

The experiments in the paper were designed to analyze the grading performance of two large language models (LLMs), GPT-4 and GPT-3.5, in three Massive Open Online Courses (MOOCs): Astrobiology, Introductory Astronomy, and the History and Philosophy of Astronomy . The study utilized a subset of 10 student writing assignments per course, with each course containing multiple assignments sourced from a proprietary repository to prevent data contamination . The grading process involved instructor grading by university professors who are experts in the relevant fields, with the grades considered as the ground truth. The grading was double-blind, ensuring instructors were unaware of peer grading and students' information . Peer grading was also incorporated, where each writing assignment was evaluated by three to four randomly selected classmates using grading rubrics provided by the instructor, and the final grade was determined by calculating the median of scores from peer graders . The experiments employed three distinct prompting strategies based on the zero-shot chain-of-thought (ZCoT) technique, which included ZCoT with instructor-provided correct answers, ZCoT integrated with both instructor-provided correct answers and rubrics, and ZCoT coupled with instructor-provided correct answers and LLM-generated rubrics . The study aimed to assess the feasibility of using LLMs to replace peer grading in MOOCs, focusing on enhancing personalized and automated grading feedback to improve the online learning experience for students globally .


What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is sourced from a proprietary repository . The code used in the study is not mentioned to be open source in the provided context.


Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide substantial support for the scientific hypotheses that need to be verified . The study conducted a thorough analysis of grading methods using Large Language Models (LLMs) in Massive Open Online Courses (MOOCs) across various subjects like Astrobiology, Introductory Astronomy, and History & Philosophy of Astronomy . The research methodology involved comparing LLM-assigned grades with instructor-assigned grades as the ground truth, ensuring a robust evaluation process .

The experiments demonstrated that LLMs, particularly GPT-4, performed well in aligning with instructor-assigned grades, even outperforming peer grading in some instances . This indicates the effectiveness of LLMs in generating grades that closely match the standards set by instructors, showcasing their potential for automating grading processes in educational contexts . The study also highlighted the challenges faced by LLMs, such as bias towards middle scores and difficulties with extremely short or long student responses .

Quantitatively, the analysis compared average scores produced by LLMs with peer grading as the baseline and instructor-assigned grades as the ground truth, aiming to identify the most effective prompt types and corresponding LLMs . The results indicated that LLM-assigned scores were generally higher and more generous than instructor-assigned grades, but they aligned more closely with the instructor's grades compared to peer grading, especially in the History and Philosophy of Astronomy course . The study also utilized bootstrap resampling to evaluate potential bias in LLM grading, providing insights into the reliability and consistency of LLM grading .

Overall, the experiments and results in the paper offer strong support for the scientific hypotheses by demonstrating the effectiveness of LLMs in grading student assignments in MOOCs and their alignment with instructor-assigned grades, despite encountering some challenges that need to be addressed for further improvement .


What are the contributions of this paper?

The paper on "Grading Massive Open Online Courses Using Large Language Models" makes several key contributions:

  • It analyzes student writing assignments in MOOCs using GPT-4 and GPT-3.5, comparing the grades generated by these models with instructor-assigned grades .
  • The study highlights that GPT-4 generally outperforms GPT-3.5 in generating grades that align closely with instructor-assigned grades, particularly in the context of the History and Philosophy of Astronomy course .
  • It introduces the concept of using Large Language Models (LLMs) to generate rubrics, suggesting that LLMs can produce improved rubrics compared to those provided by instructors due to their extensive interdisciplinary knowledge .
  • The research emphasizes the potential of LLM grading to automate the grading process in MOOCs, aiming to refine grading methodologies for better alignment with instructor evaluations, especially in advanced disciplines like Philosophy and Mathematics .
  • The paper provides insights into the use of peer grading, instructor grading, and bootstrap resampling techniques to evaluate student assignments, ensuring fair and accurate assessment in online courses .

What work can be continued in depth?

Further research can be conducted to refine grading methodologies for improved congruence with instructor evaluations, especially in advanced disciplines like Philosophy and Mathematics that require robust reasoning abilities . Additionally, exploring the potential of large language models (LLMs) in handling the task of grading assignments in Massive Open Online Courses (MOOCs) and replacing the current peer grading system with LLMs could be an area of continued study to enhance personalized and automated grading feedback for online learners globally . This research could focus on developing more advanced prompting strategies and techniques to optimize the grading performance of LLMs, such as incorporating instructor-provided correct answers and rubrics, as well as exploring the use of human-in-the-loop frameworks for automated short answer scoring to balance cost and quality .

Tables
2
Scan the QR code to ask more questions about the paper
© 2025 Powerdrill. All rights reserved.