Assessment Twins: A Protocol for AI-Vulnerable Summative Assessment
- URL: http://arxiv.org/abs/2510.02929v1
- Date: Fri, 03 Oct 2025 12:05:34 GMT
- Title: Assessment Twins: A Protocol for AI-Vulnerable Summative Assessment
- Authors: Jasper Roe, Mike Perkins, Louie Giray,
- Abstract summary: We introduce assessment twins as an accessible approach for redesigning assessment tasks to enhance validity.<n>We use Messick's unified validity framework to systematically map the ways in which GenAI threaten content, structural, consequential, generalisability, and external validity.<n>We argue that the twin approach helps mitigate validity threats by triangulating evidence across complementary formats.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generative Artificial Intelligence (GenAI) is reshaping higher education and raising pressing concerns about the integrity and validity of higher education assessment. While assessment redesign is increasingly seen as a necessity, there is a relative lack of literature detailing what such redesign may entail. In this paper, we introduce assessment twins as an accessible approach for redesigning assessment tasks to enhance validity. We use Messick's unified validity framework to systematically map the ways in which GenAI threaten content, structural, consequential, generalisability, and external validity. Following this, we define assessment twins as two deliberately linked components that address the same learning outcomes through different modes of evidence, scheduled closely together to allow for cross-verification and assurance of learning. We argue that the twin approach helps mitigate validity threats by triangulating evidence across complementary formats, such as pairing essays with oral defences, group discussions, or practical demonstrations. We highlight several advantages: preservation of established assessment formats, reduction of reliance on surveillance technologies, and flexible use across cohort sizes. To guide implementation, we propose a three-step design process: identifying vulnerabilities, aligning outcomes, selecting complementary tasks, and developing interdependent marking schemes. We also acknowledge the challenges, including resource intensity, equity concerns, and the need for empirical validation. Nonetheless, we contend that assessment twins represent a validity-focused response to GenAI that prioritises pedagogy while supporting meaningful student learning outcomes.
Related papers
- Transforming GenAI Policy to Prompting Instruction: An RCT of Scalable Prompting Interventions in a CS1 Course [8.222598094097867]
We conducted a semester-long RCT with four ICAP framework-based instructional conditions varying in engagement intensity with a pre-test, immediate and delayed post-test and surveys.<n>We found that all conditions significantly improved prompting skills, with gains increasing progressively from Condition 1 to Condition 4.<n>For students with similar pre-test scores, higher learning gain in immediate post-test predict higher final exam score, though no direct between-group differences emerged.
arXiv Detail & Related papers (2026-02-17T21:40:12Z) - CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models [55.0103764229311]
We propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories.<n>This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios.
arXiv Detail & Related papers (2026-02-05T13:13:19Z) - ChatGPT and Gemini participated in the Korean College Scholastic Ability Test -- Earth Science I [0.0]
This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs)<n> Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures.<n>By exploiting AI's weaknesses, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
arXiv Detail & Related papers (2025-12-17T10:46:41Z) - Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification [0.4260312058817663]
Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship.<n>This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions.
arXiv Detail & Related papers (2025-12-14T08:13:53Z) - Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework [0.0]
The rapid adoption of generative AI has undermined traditional modular assessments in computing education.<n>This paper presents a theoretically grounded framework for designing AI-resilient assessments.
arXiv Detail & Related papers (2025-12-11T15:53:19Z) - Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification [52.63017280231648]
Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking.<n>Person ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions.<n>We propose a dual-invariant defense framework composed of two main phases.
arXiv Detail & Related papers (2025-11-13T03:56:40Z) - Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots [0.38233569758620045]
This study investigates the reliability and perceived accuracy of AI-assisted assessment compared to TA-assisted assessment in evaluating student posters in design thinking education.<n>Results showed low statistical agreement between instructor and AI scores for empathy and pain points, with slightly higher alignment for visual communication.<n>The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights.
arXiv Detail & Related papers (2025-10-17T07:09:21Z) - Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark [69.8473923357969]
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
arXiv Detail & Related papers (2025-10-15T17:10:35Z) - RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark [71.3555284685426]
We introduce RealUnify, a benchmark designed to evaluate bidirectional capability synergy.<n>RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks.<n>We find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient.
arXiv Detail & Related papers (2025-09-29T15:07:28Z) - Understanding Catastrophic Interference On the Identifibility of Latent Representations [67.05452287233122]
Catastrophic interference, also known as catastrophic forgetting, is a fundamental challenge in machine learning.<n>We propose a novel theoretical framework that formulates catastrophic interference as an identification problem.<n>Our approach provides both theoretical guarantees and practical performance improvements across both synthetic and benchmark datasets.
arXiv Detail & Related papers (2025-09-27T00:53:32Z) - The Imitation Game for Educational AI [23.71250100390303]
We present a novel evaluation framework based on a two-phase Turing-like test.<n>In Phase 1, students provide open-ended responses to questions, revealing natural misconceptions.<n>In Phase 2, both AI and human experts, conditioned on each student's specific mistakes, generate distractors for new related questions.
arXiv Detail & Related papers (2025-02-21T01:14:55Z) - The AI Assessment Scale Revisited: A Framework for Educational Assessment [0.0]
Recent developments in Generative Artificial Intelligence (GenAI) have created significant uncertainty in education.<n>We present an updated version of the AI Assessment Scale (AIAS), a framework with two fundamental purposes.
arXiv Detail & Related papers (2024-12-12T07:44:52Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Distantly-Supervised Named Entity Recognition with Adaptive Teacher
Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models.
In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks.
Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z) - Variational Distillation for Multi-View Learning [104.17551354374821]
We design several variational information bottlenecks to exploit two key characteristics for multi-view representation learning.
Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels.
arXiv Detail & Related papers (2022-06-20T03:09:46Z) - Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system.
Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model.
We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z) - SoK: Certified Robustness for Deep Neural Networks [13.10665264010575]
Recent studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks.
In this paper, we systematize certifiably robust approaches and related practical and theoretical implications.
We also provide the first comprehensive benchmark on existing robustness verification and training approaches on different datasets.
arXiv Detail & Related papers (2020-09-09T07:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.