Related papers: Get It Scored Using AutoSAS -- An Automated System for Scoring Short Answers

Get It Scored Using AutoSAS -- An Automated System for Scoring Short Answers

URL: http://arxiv.org/abs/2012.11243v1
Date: Mon, 21 Dec 2020 10:47:30 GMT
Title: Get It Scored Using AutoSAS -- An Automated System for Scoring Short Answers
Authors: Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, Roger Zimmermann
Abstract summary: We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS) We propose and explain the design and development of a system for SAS, namely AutoSAS. AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
Score: 63.835172924290326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the era of MOOCs, online exams are taken by millions of candidates, where scoring short answers is an integral part. It becomes intractable to evaluate them by human graders. Thus, a generic automated system capable of grading these responses should be designed and deployed. In this paper, we present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS). We propose and explain the design and development of a system for SAS, namely AutoSAS. Given a question along with its graded samples, AutoSAS can learn to grade that prompt successfully. This paper further lays down the features such as lexical diversity, Word2Vec, prompt, and content overlap that plays a pivotal role in building our proposed model. We also present a methodology for indicating the factors responsible for scoring an answer. The trained model is evaluated on an extensively used public dataset, namely Automated Student Assessment Prize Short Answer Scoring (ASAP-SAS). AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts as measured by Quadratic Weighted Kappa (QWK), showing performance comparable to humans.

Related papers

SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models [36.10798324093408]
SAS-Bench is a benchmark for large language models (LLMs) based Short Answer Scoring tasks.<n>It provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types.<n>We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts.
arXiv Detail & Related papers (2025-05-12T05:43:21Z)
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations. This approach was originally developed for the TREC Question Answering (QA) Track in 2003. We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z)
Beyond Scores: A Modular RAG-Based System for Automatic Short Answer Scoring with Feedback [3.2734777984053887]
We propose a modular retrieval augmented generation based ASAS-F system that scores answers and generates feedback in strict zero-shot and few-shot learning scenarios. Results show an improvement in scoring accuracy by 9% on unseen questions compared to fine-tuning, offering a scalable and cost-effective solution.
arXiv Detail & Related papers (2024-09-30T07:48:55Z)
ASAG2024: A Combined Benchmark for Short Answer Grading [0.10826342457160269]
Short Answer Grading (SAG) systems aim to automatically score students' answers. There exists no comprehensive short-answer grading benchmark across different subjects, grading scales, and distributions. We introduce the combined ASAG2024 benchmark to facilitate the comparison of automated grading systems.
arXiv Detail & Related papers (2024-09-27T09:56:02Z)
Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring [17.1154345762798]
We train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Experiments show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy. It is crucial to design the model so that it can learn the task's general property.
arXiv Detail & Related papers (2024-08-26T00:23:56Z)
"I understand why I got this grade": Automatic Short Answer Grading with Feedback [33.63970664152288]
We introduce Engineering Short Answer Feedback (EngSAF), a dataset designed for automatic short-answer grading with feedback.<n>We incorporate feedback into our dataset by leveraging the generative capabilities of state-of-the-art large language models (LLMs) using our Label-Aware Synthetic Feedback Generation (LASFG) strategy.<n>The best-performing model (Mistral-7B) achieves an overall accuracy of 75.4% and 58.7% on unseen answers and unseen question test sets, respectively.
arXiv Detail & Related papers (2024-06-30T15:42:18Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
AutoSurvey: Large Language Models Can Automatically Write Surveys [77.0458309675818]
This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys. Traditional survey paper creation faces challenges due to the vast volume and complexity of information. Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.
arXiv Detail & Related papers (2024-06-10T12:56:06Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG) Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z)
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.