Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs
- URL: http://arxiv.org/abs/2603.02353v2
- Date: Wed, 04 Mar 2026 18:35:26 GMT
- Title: Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs
- Authors: Jiangang Hao,
- Abstract summary: Writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning.<n>The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays.<n>This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.
Related papers
- Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review [53.99984738447279]
Recent work frames this task as automatic text generation, underusing author expertise and intent.<n>We introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement.<n>To support this formulation, we construct Re$3$Align, the first large-scale dataset of aligned review-response--revision triplets.
arXiv Detail & Related papers (2026-01-19T14:07:10Z) - Exposía: Academic Writing Assessment of Exposés and Peer Feedback [56.428320613219306]
We present Exposa, the first public dataset that connects writing and feedback assessment in higher education.<n>We use Exposa to benchmark state-of-the-art open-source large language models (LLMs) for two tasks: automated scoring of (1) the proposals and (2) the student reviews.
arXiv Detail & Related papers (2026-01-10T11:33:26Z) - Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents [2.825140278227664]
Formative feedback is one of the most effective drivers of student learning.<n>In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection.<n>This paper presents a theory-grounded system that uses five coordinated role-based LLM agents to score learner reflections.
arXiv Detail & Related papers (2025-11-14T09:46:21Z) - CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z) - A Multi-Task Evaluation of LLMs' Processing of Academic Text Input [6.654906601143054]
How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review is in heated debate.<n>We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs' processing of academic text input.<n>We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM.
arXiv Detail & Related papers (2025-08-15T19:05:57Z) - Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection [44.05134959039957]
We investigate how sociolinguistic attributes-gender, CEFR proficiency, academic field, and language environment-impact state-of-the-art AI text detectors.<n>Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects.<n>These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups.
arXiv Detail & Related papers (2025-02-18T07:49:31Z) - AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity [13.371946973050845]
We examine and benchmark the characteristics and quality of essays generated by popular large language models (LLMs)<n>Our findings highlight limitations in existing automated scoring systems, and identify areas for improvement.<n>Despite concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy.
arXiv Detail & Related papers (2024-10-22T21:30:58Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs [49.18567856499736]
We investigate whether large language models (LLMs) can be supportive of open-ended dialogue tutoring.<n>We apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue.<n>We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues.
arXiv Detail & Related papers (2024-09-24T22:31:39Z) - Igniting Language Intelligence: The Hitchhiker's Guide From
Chain-of-Thought Reasoning to Language Agents [80.5213198675411]
Large language models (LLMs) have dramatically enhanced the field of language intelligence.
LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer.
Recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents.
arXiv Detail & Related papers (2023-11-20T14:30:55Z) - Creativity Support in the Age of Large Language Models: An Empirical
Study Involving Emerging Writers [33.3564201174124]
We investigate the utility of modern large language models in assisting professional writers via an empirical user study.
We find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing.
arXiv Detail & Related papers (2023-09-22T01:49:36Z) - Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs)
Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process.
We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z) - Structural Pre-training for Dialogue Comprehension [51.215629336320305]
We present SPIDER, Structural Pre-traIned DialoguE Reader, to capture dialogue exclusive features.
To simulate the dialogue-like features, we propose two training objectives in addition to the original LM objectives.
Experimental results on widely used dialogue benchmarks verify the effectiveness of the newly introduced self-supervised tasks.
arXiv Detail & Related papers (2021-05-23T15:16:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.