AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity
- URL: http://arxiv.org/abs/2410.17439v4
- Date: Thu, 16 Oct 2025 01:36:57 GMT
- Title: AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity
- Authors: Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang,
- Abstract summary: We examine and benchmark the characteristics and quality of essays generated by popular large language models (LLMs)<n>Our findings highlight limitations in existing automated scoring systems, and identify areas for improvement.<n>Despite concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy.
- Score: 13.371946973050845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems, such as e-rater, when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.
Related papers
- Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs [0.0]
Writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning.<n>The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays.<n>This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use.
arXiv Detail & Related papers (2026-03-02T19:51:01Z) - Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection [71.59834293521074]
We develop a framework to distinguish between human-authored and machine-generated text.<n>Our method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset.<n>Code, pretrained weights, and demo will be released.
arXiv Detail & Related papers (2025-10-07T08:14:45Z) - CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Automatic Legal Writing Evaluation of LLMs [10.74636407144071]
oab-bench is a benchmark comprising 105 questions across seven areas of law from recent editions of the exam.
Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams.
Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams.
arXiv Detail & Related papers (2025-04-29T22:16:39Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.
Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.
We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection [44.05134959039957]
We investigate how sociolinguistic attributes-gender, CEFR proficiency, academic field, and language environment-impact state-of-the-art AI text detectors.<n>Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects.<n>These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups.
arXiv Detail & Related papers (2025-02-18T07:49:31Z) - Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications [0.0]
generative AI is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring.<n>We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI.
arXiv Detail & Related papers (2025-01-04T16:59:29Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.
We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications [26.857056013032263]
evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications.
Our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications.
arXiv Detail & Related papers (2024-04-10T15:46:08Z) - Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction [14.822205658480813]
Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks.
This study investigates the performance of LLMs in grammatical error correction (GEC) evaluation by employing prompts inspired by previous research.
arXiv Detail & Related papers (2024-03-26T09:43:15Z) - Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated
Student Essay Detection [29.433764586753956]
Large language models (LLMs) have exhibited remarkable capabilities in text generation tasks.
The utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises.
This paper aims to bridge this gap by constructing AIG-ASAP, an AI-generated student essay dataset.
arXiv Detail & Related papers (2024-02-01T08:11:56Z) - DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Towards LLM-based Autograding for Short Textual Answers [4.853810201626855]
This manuscript is an evaluation of a large language model for the purpose of autograding.
Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool, their readiness for independent automated grading remains a work in progress.
arXiv Detail & Related papers (2023-09-09T22:25:56Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.