Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments
- URL: http://arxiv.org/abs/2412.02713v2
- Date: Thu, 12 Dec 2024 13:28:20 GMT
- Title: Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments
- Authors: Alona Strugatski, Giora Alexandron,
- Abstract summary: Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating has been almost unexplored.<n>We propose a method based on the application of Item Response Theory to address this gap.<n>Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.
Related papers
- On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.
We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z) - Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content.
We systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset.
Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z) - The Imitation Game for Educational AI [23.71250100390303]
We present a novel evaluation framework based on a two-phase Turing-like test.
In Phase 1, students provide open-ended responses to questions, revealing natural misconceptions.
In Phase 2, both AI and human experts, conditioned on each student's specific mistakes, generate distractors for new related questions.
arXiv Detail & Related papers (2025-02-21T01:14:55Z) - DAMAGE: Detecting Adversarially Modified AI Generated Text [0.13108652488669736]
We show that many existing AI detectors fail to detect humanized text.
We demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate.
arXiv Detail & Related papers (2025-01-06T23:43:49Z) - Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications [0.0]
generative AI is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring.
We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI.
arXiv Detail & Related papers (2025-01-04T16:59:29Z) - Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA [43.116608441891096]
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning.
State-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval.
arXiv Detail & Related papers (2024-10-09T03:53:26Z) - Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content.
We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - Towards ethical multimodal systems [10.518452055021488]
The emerging field of AI alignment aims to make AI systems reflect human values.
This paper focuses on evaluating the ethics of multimodal AI systems involving both text and images.
arXiv Detail & Related papers (2023-04-26T18:11:33Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.
The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.
We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Metaethical Perspectives on 'Benchmarking' AI Ethics [81.65697003067841]
Benchmarks are seen as the cornerstone for measuring technical progress in Artificial Intelligence (AI) research.
An increasingly prominent research area in AI is ethics, which currently has no set of benchmarks nor commonly accepted way for measuring the 'ethicality' of an AI system.
We argue that it makes more sense to talk about 'values' rather than 'ethics' when considering the possible actions of present and future AI systems.
arXiv Detail & Related papers (2022-04-11T14:36:39Z) - A Turing Test for Transparency [0.0]
A central goal of explainable artificial intelligence (XAI) is to improve the trust relationship in human-AI interaction.
Recent empirical evidence shows that explanations can have the opposite effect.
This effect challenges the very goal of XAI and implies that responsible usage of transparent AI methods has to consider the ability of humans to distinguish machine generated from human explanations.
arXiv Detail & Related papers (2021-06-21T20:09:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.