Automated Long Answer Grading with RiceChem Dataset
- URL: http://arxiv.org/abs/2404.14316v1
- Date: Mon, 22 Apr 2024 16:28:09 GMT
- Title: Automated Long Answer Grading with RiceChem Dataset
- Authors: Shashank Sonkar, Kangqi Ni, Lesa Tran Lu, Kristi Kincaid, John S. Hutchinson, Richard G. Baraniuk,
- Abstract summary: We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG)
ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers.
We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models.
- Score: 19.34390869143846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a college chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student's response. This formulation enables the effective use of MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances of student responses. We also investigate the performance of models in cold start scenarios, providing valuable insights into the practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. Code: \url{https://github.com/luffycodes/Automated-Long-Answer-Grading}.
Related papers
- Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation [72.70046559930555]
We propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks.
Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes.
In addition, we employ an adaptive, note-based stop-exploration strategy to decide "what to retrieve and when to stop" to encourage sufficient knowledge exploration.
arXiv Detail & Related papers (2024-10-11T14:03:29Z) - A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization [31.722907135361492]
Open-ended short-answer questions (SAGs) have been widely recognized as a powerful tool for providing deeper insights into learners' responses in the context of learning analytics (LA)
SAGs often present challenges in practice due to the high grading workload and concerns about inconsistent assessments.
We propose a unified multi-agent ASAG framework, GradeOpt, which leverages large language models (LLMs) as graders for SAGs.
arXiv Detail & Related papers (2024-10-03T03:11:24Z) - ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - "I understand why I got this grade": Automatic Short Answer Grading with Feedback [36.74896284581596]
We present a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task.
The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains.
arXiv Detail & Related papers (2024-06-30T15:42:18Z) - Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs)
LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG)
Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question
Answering over a Life Science Knowledge Graph [0.0]
We evaluate strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs.
We propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph.
We also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments.
arXiv Detail & Related papers (2024-02-07T07:24:01Z) - Enhancing Textbook Question Answering Task with Large Language Models
and Retrieval Augmented Generation [3.948068081583197]
This paper proposes a methodology that handle the out-of-domain scenario in Textbook question answering (TQA)
Through supervised fine-tuning of the LLM model Llama-2 and the incorporation of RAG, our architecture outperforms the baseline, achieving a 4.12% accuracy improvement on validation set and 9.84% on test set for non-diagram multiple-choice questions.
arXiv Detail & Related papers (2024-02-05T11:58:56Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.