AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
- URL: http://arxiv.org/abs/2410.09507v1
- Date: Sat, 12 Oct 2024 11:57:53 GMT
- Title: AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
- Authors: Jiazheng Li, Artem Bobrov, David West, Cesare Aloisi, Yulan He,
- Abstract summary: AERA Chat is an interactive platform to provide visually explained assessment of student answers.
Users can input questions and student answers to obtain automated, explainable assessment results from large language models.
- Score: 12.970776782360366
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generating rationales that justify scoring decisions has emerged as a promising approach to enhance explainability in the development of automated scoring systems. However, the scarcity of publicly available rationale data and the high cost of annotation have resulted in existing methods typically relying on noisy rationales generated by large language models (LLMs). To address these challenges, we have developed AERA Chat, an interactive platform, to provide visually explained assessment of student answers and streamline the verification of rationales. Users can input questions and student answers to obtain automated, explainable assessment results from LLMs. The platform's innovative visualization features and robust evaluation tools make it useful for educators to assist their marking process, and for researchers to evaluate assessment performance and quality of rationales generated by different LLMs, or as a tool for efficient annotation. We evaluated three rationale generation approaches on our platform to demonstrate its capability.
Related papers
- Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses [0.0]
This study aims to explore the potential of Large Language Models (LLMs) in facilitating automated feedback in math education.
We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems.
We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers.
arXiv Detail & Related papers (2024-10-29T16:57:45Z) - An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - Digital Socrates: Evaluating LLMs through Explanation Critiques [37.25959112212333]
Digital Socrates is an open-source, automatic critique model for model explanations.
We show how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains.
arXiv Detail & Related papers (2023-11-16T06:51:46Z) - Empowering Private Tutoring by Chaining Large Language Models [87.76985829144834]
This work explores the development of a full-fledged intelligent tutoring system powered by state-of-the-art large language models (LLMs)
The system is into three inter-connected core processes-interaction, reflection, and reaction.
Each process is implemented by chaining LLM-powered tools along with dynamically updated memory modules.
arXiv Detail & Related papers (2023-09-15T02:42:03Z) - Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation.
Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.