Related papers: Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing

Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing

URL: http://arxiv.org/abs/2506.03501v1
Date: Wed, 04 Jun 2025 02:31:36 GMT
Title: Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing
Authors: Yuchen Guo, Zhicheng Dou, Huy H. Nguyen, Ching-Chun Chang, Saku Sugawara, Isao Echizen,
Abstract summary: Survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports.<n>Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness.<n>This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream.<n>We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem.
Score: 39.5254201243129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at https://github.com/gyc-nii/CAS-CS-and-dual-head-detector

Related papers

mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection [0.0]
An automated detection is able to assist humans to indicate the machine-generated texts.<n>This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification.<n>It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025.
arXiv Detail & Related papers (2025-06-02T14:07:32Z)
Beyond human subjectivity and error: a novel AI grading system [67.410870290301]
The grading of open-ended questions is a high-effort, high-impact task in education. Recent breakthroughs in AI technology might facilitate such automation, but this has not been demonstrated at scale. We introduce a novel automatic short answer grading (ASAG) system.
arXiv Detail & Related papers (2024-05-07T13:49:59Z)
Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z)
Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work [0.38850145898707145]
Large language models (LLMs) and generative AI tools present challenges in identifying and addressing biases. generative AI tools are susceptible to goal misgeneralization, hallucinations, and adversarial attacks such as red teaming prompts. We find that incorporating generative AI in the process of writing research manuscripts introduces a new type of context-induced algorithmic bias.
arXiv Detail & Related papers (2023-12-04T04:05:04Z)
Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education [10.606131520965604]
This study investigates AI content detection in a rarely explored yet realistic setting. We first formalized the detection task as identifying the transition points between human-written content and AI-generated content. We then proposed a two-step approach where we separated AI-generated content from human-written content during the encoder training process.
arXiv Detail & Related papers (2023-07-23T08:47:51Z)
Distinguishing Human Generated Text From ChatGPT Generated Text Using Machine Learning [0.251657752676152]
This paper presents a machine learning-based solution that can identify the ChatGPT delivered text from the human written text. We have tested the proposed model on a Kaggle dataset consisting of 10,000 texts out of which 5,204 texts were written by humans and collected from news and social media. On the corpus generated by GPT-3.5, the proposed algorithm presents an accuracy of 77%.
arXiv Detail & Related papers (2023-05-26T09:27:43Z)
Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text [23.622347443796183]
We study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models. We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time.
arXiv Detail & Related papers (2022-12-24T06:40:25Z)
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes [93.19166902594168]
We propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. Key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages.
arXiv Detail & Related papers (2022-12-19T09:02:16Z)
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.