Automated Multiple Mini Interview (MMI) Scoring
- URL: http://arxiv.org/abs/2602.02360v1
- Date: Mon, 02 Feb 2026 17:20:25 GMT
- Title: Automated Multiple Mini Interview (MMI) Scoring
- Authors: Ryan Huynh, Frank Guerin, Alison Callwood,
- Abstract summary: We show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Mini-Interviews.<n>We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring.
- Score: 5.277507079014855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.
Related papers
- Optimizing In-Context Demonstrations for LLM-based Automated Grading [31.353360036776976]
GUIDE (Grading Using Iteratively Designed Exemplars) is a framework that reframes exemplar selection and refinement as a boundary-focused optimization problem.<n>We show that GUIDE significantly outperforms standard retrieval baselines in experiments in physics, chemistry, and pedagogical content knowledge.
arXiv Detail & Related papers (2026-02-28T04:52:38Z) - Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays [15.895792302323883]
In educational contexts, teachers and learners require interpretable, trait-level feedback.<n>We study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms.<n>We show that explicitly modeling score ordinality substantially improves agreement with human raters.
arXiv Detail & Related papers (2026-02-04T14:30:52Z) - Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy [28.293009223912602]
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall.<n>This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment.<n>We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's taxonomy.
arXiv Detail & Related papers (2026-01-28T05:01:11Z) - OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series [36.88936933010042]
OutboundEval is a comprehensive benchmark for evaluating large language models (LLMs) in intelligent outbound calling scenarios.<n>We design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics.<n>We introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality.
arXiv Detail & Related papers (2025-10-24T08:27:58Z) - AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition [27.312190686305588]
Large language models (LLMs) have shown strong potential in automated scoring.<n>Their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment.<n>We propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition.
arXiv Detail & Related papers (2025-09-26T05:45:14Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks [81.44256822500257]
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
arXiv Detail & Related papers (2025-05-19T08:33:11Z) - Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - Leveraging LLMs for Dialogue Quality Measurement [27.046917937460798]
Large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks.
Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures.
Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
arXiv Detail & Related papers (2024-06-25T06:19:47Z) - SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation [23.203761925540736]
We propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation)
Our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE exhibits better correlation with human evaluators.
arXiv Detail & Related papers (2024-05-24T20:32:49Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains.
RADDLE is designed to favor and encourage models with a strong generalization ability.
We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.