Related papers: Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

URL: http://arxiv.org/abs/2512.12592v1
Date: Sun, 14 Dec 2025 08:13:53 GMT
Title: Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification
Authors: Tom Lee, Sihoon Lee, Seonghun Kim,
Abstract summary: Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship.<n>This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions.
Score: 0.4260312058817663
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship. While recent research has focused on the accuracy of automated scoring (AES), these static approaches fail to capture process evidence or verify genuine student understanding. This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions. In a pilot study with university instructors (N=9), we demonstrate that while Stage 1 (Auto-Scoring) ensures procedural fairness and consistency, Stage 2 (Interactive Verification) is essential for construct validity, effectively diagnosing superficial reasoning or unverified AI use. We report on the systems design, instructor perceptions of fairness versus validity, and the necessity of adaptive difficulty in follow-up questioning. The findings offer a scalable pathway for authentic assessment that moves beyond policing AI to integrating it as a synergistic partner in the evaluation process.

Related papers

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents [51.30998248590416]
Trajectory-Aware Comprehensive Evaluation (TRACE) is a framework that holistically assesses the entire problem-solving trajectory.<n>Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity.
arXiv Detail & Related papers (2026-02-05T13:28:57Z)
A Unified XAI-LLM Approach for EndotrachealSuctioning Activity Recognition [0.1794226570005898]
This study proposes a unified framework for video-based activity recognition benchmarked against conventional machine learning and deep learning approaches.<n>Within this framework, the Large Language Model (LLM) serves as the central reasoning module, performing bothtemporal activity recognition and explainable decision analysis from video data.<n> Experimental results demonstrate that the proposed LLM-based approach outperforms baseline models, achieving an improvement of approximately 15-20% in both accuracy and F1 score.
arXiv Detail & Related papers (2026-01-29T14:46:48Z)
Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework [0.0]
The rapid adoption of generative AI has undermined traditional modular assessments in computing education.<n>This paper presents a theoretically grounded framework for designing AI-resilient assessments.
arXiv Detail & Related papers (2025-12-11T15:53:19Z)
Assessment Twins: A Protocol for AI-Vulnerable Summative Assessment [0.0]
We introduce assessment twins as an accessible approach for redesigning assessment tasks to enhance validity.<n>We use Messick's unified validity framework to systematically map the ways in which GenAI threaten content, structural, consequential, generalisability, and external validity.<n>We argue that the twin approach helps mitigate validity threats by triangulating evidence across complementary formats.
arXiv Detail & Related papers (2025-10-03T12:05:34Z)
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark [71.3555284685426]
We introduce RealUnify, a benchmark designed to evaluate bidirectional capability synergy.<n>RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks.<n>We find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient.
arXiv Detail & Related papers (2025-09-29T15:07:28Z)
AIssistant: An Agentic Approach for Human--AI Collaborative Scientific Work on Reviews and Perspectives in Machine Learning [2.464267718050055]
We present here the first experiments with AIssistant for perspective and review research papers in machine learning.<n>Our system integrates modular tools and agents for literature, section-wise experimentation, citation management, and automatic paper text generation.<n>Despite its effectiveness, we identify key limitations, including hallucinated citations, difficulty adapting to dynamic paper structures, and incomplete integration of multimodal content.
arXiv Detail & Related papers (2025-09-14T15:50:31Z)
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z)
Breaking Barriers in Software Testing: The Power of AI-Driven Automation [0.0]
This paper presents an AI-driven framework that automates test case generation and validation using natural language processing (NLP), reinforcement learning (RL), and predictive models, embedded within a policy-driven trust and fairness model.<n>Case studies demonstrate measurable gains in defect detection, reduced testing effort, and faster release cycles, showing that AI-enhanced testing improves both efficiency and reliability.
arXiv Detail & Related papers (2025-08-22T01:04:50Z)
Beyond Detection: Designing AI-Resilient Assessments with Automated Feedback Tool to Foster Critical Thinking [0.0]
This research proposes a proactive, AI-resilient solution based on assessment design rather than detection.<n>It introduces a web-based Python tool that integrates Bloom's taxonomy with advanced natural language processing techniques.<n>It helps educators determine whether a task targets lower-order thinking such as recall and summarization or higher-order skills such as analysis, evaluation, and creation.
arXiv Detail & Related papers (2025-03-30T23:13:00Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric. We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.