Related papers: AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

URL: http://arxiv.org/abs/2410.09507v1
Date: Sat, 12 Oct 2024 11:57:53 GMT
Title: AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
Authors: Jiazheng Li, Artem Bobrov, David West, Cesare Aloisi, Yulan He,
Abstract summary: AERA Chat is an interactive platform to provide visually explained assessment of student answers. Users can input questions and student answers to obtain automated, explainable assessment results from large language models.
Score: 12.970776782360366
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generating rationales that justify scoring decisions has emerged as a promising approach to enhance explainability in the development of automated scoring systems. However, the scarcity of publicly available rationale data and the high cost of annotation have resulted in existing methods typically relying on noisy rationales generated by large language models (LLMs). To address these challenges, we have developed AERA Chat, an interactive platform, to provide visually explained assessment of student answers and streamline the verification of rationales. Users can input questions and student answers to obtain automated, explainable assessment results from LLMs. The platform's innovative visualization features and robust evaluation tools make it useful for educators to assist their marking process, and for researchers to evaluate assessment performance and quality of rationales generated by different LLMs, or as a tool for efficient annotation. We evaluated three rationale generation approaches on our platform to demonstrate its capability.

Related papers

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models [86.83875864328984]
We propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities.
arXiv Detail & Related papers (2025-02-24T07:02:31Z)
Transforming Student Evaluation with Adaptive Intelligence and Performance Analytics [0.0]
This paper creates a system for the evaluation of students performance using Artificial intelligence. There are formats of questions in the system which comprises multiple choice, short answers and descriptive questions. The platform has intelligent learning progressions where the user will be able to monitor his/her performances to be recommended a certain level of performance.
arXiv Detail & Related papers (2025-02-07T18:57:51Z)
Human-Centered Design for AI-based Automatically Generated Assessment Reports: A Systematic Review [4.974197456441281]
This study emphasizes the importance of reducing teachers' cognitive demands through user-centered and intuitive designs. It highlights the potential of diverse information presentation formats such as text, visual aids, and plots and advanced functionalities such as live and interactive features to enhance usability. The framework aims to address challenges in engaging teachers with technology-enhanced assessment results, facilitating data-driven decision-making, and providing personalized feedback to improve the teaching and learning process.
arXiv Detail & Related papers (2024-12-30T16:20:07Z)
An Automated Explainable Educational Assessment System Built on LLMs [12.970776782360366]
AERA Chat is an automated educational assessment system designed for interactive and visual evaluations of student responses. Our system allows users to input questions and student answers, providing educators and researchers with insights into assessment accuracy.
arXiv Detail & Related papers (2024-12-17T23:29:18Z)
Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses [0.0]
This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
arXiv Detail & Related papers (2024-10-29T16:57:45Z)
An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
Digital Socrates: Evaluating LLMs through Explanation Critiques [37.25959112212333]
Digital Socrates is an open-source, automatic critique model for model explanations. We show how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains.
arXiv Detail & Related papers (2023-11-16T06:51:46Z)
Empowering Private Tutoring by Chaining Large Language Models [87.76985829144834]
This work explores the development of a full-fledged intelligent tutoring system powered by state-of-the-art large language models (LLMs) The system is into three inter-connected core processes-interaction, reflection, and reaction. Each process is implemented by chaining LLM-powered tools along with dynamically updated memory modules.
arXiv Detail & Related papers (2023-09-15T02:42:03Z)
Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z)
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult. We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z)
Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric. We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.