Related papers: Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams

Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams

URL: http://arxiv.org/abs/2505.23818v1
Date: Tue, 27 May 2025 22:17:27 GMT
Title: Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams
Authors: Masoud Safilian, Amin Beheshti, Stephen Elbourn,
Abstract summary: RATAS (Rubric Automated Tree-based Answer Scoring) is a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses.<n> RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores.
Score: 3.4132239125074206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated answer grading is a critical challenge in educational technology, with the potential to streamline assessment processes, ensure grading consistency, and provide timely feedback to students. However, existing approaches are often constrained to specific exam formats, lack interpretability in score assignment, and struggle with real-world applicability across diverse subjects and assessment types. To address these limitations, we introduce RATAS (Rubric Automated Tree-based Answer Scoring), a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses. RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores. We formalize the automatic grading task through a mathematical framework tailored to rubric-based assessment and present an architecture capable of handling complex, real-world exam structures. To rigorously evaluate our approach, we construct a unique, contextualized dataset derived from real-world project-based courses, encompassing diverse response formats and varying levels of complexity. Empirical results demonstrate that RATAS achieves high reliability and accuracy in automated grading while providing interpretable feedback that enhances transparency for both students and nstructors.

Related papers

Machine vs Machine: Using AI to Tackle Generative AI Threats in Assessment [0.0]
This paper presents a theoretical framework for addressing the challenges posed by generative artificial intelligence (AI) in higher education assessment.<n>Large language models like GPT-4, Claude, and Llama increasingly demonstrate the ability to produce sophisticated academic content.<n>Surveys indicate 74-92% of students experimenting with these tools for academic purposes.
arXiv Detail & Related papers (2025-05-31T22:29:43Z)
Benchmarking and Rethinking Knowledge Editing for Large Language Models [34.80161437154527]
Knowledge editing aims to update embedded knowledge within Large Language Models (LLMs)<n>Existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups.<n>This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
arXiv Detail & Related papers (2025-05-24T13:32:03Z)
StepGrade: Grading Programming Assignments with Context-Aware LLMs [0.6725011823614421]
This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs)<n>Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria.<n>To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels.
arXiv Detail & Related papers (2025-03-26T17:36:26Z)
WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.<n>We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.<n>This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment [12.970776782360366]
AERA Chat is an interactive platform to provide visually explained assessment of student answers. Users can input questions and student answers to obtain automated, explainable assessment results from large language models.
arXiv Detail & Related papers (2024-10-12T11:57:53Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)
"I understand why I got this grade": Automatic Short Answer Grading with Feedback [36.74896284581596]
We present a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains.
arXiv Detail & Related papers (2024-06-30T15:42:18Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Evaluating the Generation Capabilities of Large Chinese Language Models [27.598864484231477]
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework. It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines. Gscore automates the quality measurement of a model's text generation against reference standards.
arXiv Detail & Related papers (2023-08-09T09:22:56Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric. We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.