FABRIC: Automated Scoring and Feedback Generation for Essays
- URL: http://arxiv.org/abs/2310.05191v1
- Date: Sun, 8 Oct 2023 15:00:04 GMT
- Title: FABRIC: Automated Scoring and Feedback Generation for Essays
- Authors: Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Hyunseung Lim, Yoonsu
Kim, Tak Yeon Lee, Hwajung Hong, Juho Kim, So-Yeon Ahn, Alice Oh
- Abstract summary: We present FABRIC, a pipeline to help students and instructors in English writing classes by automatically generating 1) the overall scores, 2) specific rubric-based scores, and 3) detailed feedback on how to improve the essays.
We evaluate the effectiveness of the new DREsS and the augmentation strategy CASE quantitatively and show significant improvements over the models trained with existing datasets.
- Score: 41.979996110725324
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automated essay scoring (AES) provides a useful tool for students and
instructors in writing classes by generating essay scores in real-time.
However, previous AES models do not provide more specific rubric-based scores
nor feedback on how to improve the essays, which can be even more important
than the overall scores for learning. We present FABRIC, a pipeline to help
students and instructors in English writing classes by automatically generating
1) the overall scores, 2) specific rubric-based scores, and 3) detailed
feedback on how to improve the essays. Under the guidance of English education
experts, we chose the rubrics for the specific scores as content, organization,
and language. The first component of the FABRIC pipeline is DREsS, a real-world
Dataset for Rubric-based Essay Scoring (DREsS). The second component is CASE, a
Corruption-based Augmentation Strategy for Essays, with which we can improve
the accuracy of the baseline model by 45.44%. The third component is EssayCoT,
the Essay Chain-of-Thought prompting strategy which uses scores predicted from
the AES model to generate better feedback. We evaluate the effectiveness of the
new dataset DREsS and the augmentation strategy CASE quantitatively and show
significant improvements over the models trained with existing datasets. We
evaluate the feedback generated by EssayCoT with English education experts to
show significant improvements in the helpfulness of the feedback across all
rubrics. Lastly, we evaluate the FABRIC pipeline with students in a college
English writing class who rated the generated scores and feedback with an
average of 6 on the Likert scale from 1 to 7.
Related papers
- Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple Regression [27.152245569974678]
We develop two models that automatically score English essays across multiple dimensions.
Our systems achieve impressive performance in evaluation using three criteria: precision, F1 score, and Quadratic Weighted Kappa.
arXiv Detail & Related papers (2024-06-03T10:59:50Z) - Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation [13.854903594424876]
Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text.
This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback.
Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback.
arXiv Detail & Related papers (2024-04-24T12:48:06Z) - DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing [16.76905904995145]
We release DREsS, a large-scale, standard dataset for rubric-based automated essay scoring.
DREsS comprises three sub-datasets: DREsS_New, DREsS_Std., and DREsS_CASE.
arXiv Detail & Related papers (2024-02-21T09:12:16Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Evaluation of ChatGPT Feedback on ELL Writers' Coherence and Cohesion [0.7028778922533686]
ChatGPT has had a transformative effect on education where students are using it to help with homework assignments and teachers are actively employing it in their teaching practices.
This study evaluated the quality of the feedback generated by ChatGPT regarding the coherence and cohesion of the essays written by English Language learners (ELLs) students.
arXiv Detail & Related papers (2023-10-10T10:25:56Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Short Answer Grading Using One-shot Prompting and Text Similarity
Scoring Model [2.14986347364539]
We developed an automated short answer grading model that provided both analytic scores and holistic scores.
The accuracy and quadratic weighted kappa of our model were 0.67 and 0.71 on a subset of the publicly available ASAG dataset.
arXiv Detail & Related papers (2023-05-29T22:05:29Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.