Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation
- URL: http://arxiv.org/abs/2404.15845v1
- Date: Wed, 24 Apr 2024 12:48:06 GMT
- Title: Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation
- Authors: Maja Stahl, Leon Biermann, Andreas Nehring, Henning Wachsmuth,
- Abstract summary: Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text.
This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback.
Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback.
- Score: 13.854903594424876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.
Related papers
- An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions [6.216542656489173]
We propose PROF that PROduces Feedback via learning from LM simulated student revisions.
We empirically test the efficacy of PROF and observe that our approach surpasses a variety of baseline methods in effectiveness of improving students' writing.
arXiv Detail & Related papers (2024-10-10T15:52:48Z) - "My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays [6.810086342993699]
This paper introduces CAELF, a Contestable AI Empowered LLM Framework for automating interactive feedback.
CAELF allows students to query, challenge, and clarify their feedback by integrating a multi-agent system with computational argumentation.
A case study on 500 critical thinking essays with user studies demonstrates that CAELF significantly improves interactive feedback.
arXiv Detail & Related papers (2024-09-11T17:59:01Z) - Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education [1.6340559025561785]
Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text.
We investigate whether LLMs can effectively assess human-written text for educational purposes.
arXiv Detail & Related papers (2024-07-24T06:02:57Z) - Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research.
This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Factually Consistent Summarization via Reinforcement Learning with
Textual Entailment Feedback [57.816210168909286]
We leverage recent progress on textual entailment models to address this problem for abstractive summarization systems.
We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency.
Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries.
arXiv Detail & Related papers (2023-05-31T21:04:04Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Annotation and Classification of Evidence and Reasoning Revisions in
Argumentative Writing [0.9449650062296824]
We introduce an annotation scheme to capture the nature of sentence-level revisions of evidence use and reasoning.
We show that reliable manual annotation can be achieved and that revision annotations correlate with a holistic assessment of essay improvement.
arXiv Detail & Related papers (2021-07-14T20:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.