Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation
- URL: http://arxiv.org/abs/2404.15845v1
- Date: Wed, 24 Apr 2024 12:48:06 GMT
- Title: Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation
- Authors: Maja Stahl, Leon Biermann, Andreas Nehring, Henning Wachsmuth,
- Abstract summary: Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text.
This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback.
Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback.
- Score: 13.854903594424876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.
Related papers
- Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education [1.6340559025561785]
Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text.
We investigate whether LLMs can effectively assess human-written text for educational purposes.
arXiv Detail & Related papers (2024-07-24T06:02:57Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Evaluation of ChatGPT Feedback on ELL Writers' Coherence and Cohesion [0.7028778922533686]
ChatGPT has had a transformative effect on education where students are using it to help with homework assignments and teachers are actively employing it in their teaching practices.
This study evaluated the quality of the feedback generated by ChatGPT regarding the coherence and cohesion of the essays written by English Language learners (ELLs) students.
arXiv Detail & Related papers (2023-10-10T10:25:56Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - FABRIC: Automated Scoring and Feedback Generation for Essays [41.979996110725324]
We present FABRIC, a pipeline to help students and instructors in English writing classes by automatically generating 1) the overall scores, 2) specific rubric-based scores, and 3) detailed feedback on how to improve the essays.
We evaluate the effectiveness of the new DREsS and the augmentation strategy CASE quantitatively and show significant improvements over the models trained with existing datasets.
arXiv Detail & Related papers (2023-10-08T15:00:04Z) - Factually Consistent Summarization via Reinforcement Learning with
Textual Entailment Feedback [57.816210168909286]
We leverage recent progress on textual entailment models to address this problem for abstractive summarization systems.
We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency.
Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries.
arXiv Detail & Related papers (2023-05-31T21:04:04Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - An Exploration of Post-Editing Effectiveness in Text Summarization [58.99765574294715]
"Post-editing" AI-generated text reduces human workload and improves the quality of AI output.
We compare post-editing provided summaries with manual summarization for summary quality, human efficiency, and user experience.
This study sheds valuable insights on when post-editing is useful for text summarization.
arXiv Detail & Related papers (2022-06-13T18:00:02Z) - Annotation and Classification of Evidence and Reasoning Revisions in
Argumentative Writing [0.9449650062296824]
We introduce an annotation scheme to capture the nature of sentence-level revisions of evidence use and reasoning.
We show that reliable manual annotation can be achieved and that revision annotations correlate with a holistic assessment of essay improvement.
arXiv Detail & Related papers (2021-07-14T20:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.