A Comprehensive Dataset for Human vs. AI Generated Text Detection
- URL: http://arxiv.org/abs/2510.22874v1
- Date: Sun, 26 Oct 2025 23:50:52 GMT
- Title: A Comprehensive Dataset for Human vs. AI Generated Text Detection
- Authors: Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amitava Das,
- Abstract summary: We present a comprehensive dataset comprising over 58,000 text samples from authentic New York Times articles.<n>The dataset provides original article abstracts as prompts, full human-authored narratives.<n>We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, and attributing AI texts to their generating models with an accuracy of 8.92%.
- Score: 23.0218614564443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/gsingh1-py/train.
Related papers
- ChatGpt Content detection: A new approach using xlm-roberta alignment [0.0]
We present a comprehensive methodology to detect AI-generated text using XLM-RoBERTa, a state-of-the-art multilingual transformer model.<n>We fine-tuned the model on a balanced dataset of human and AI-generated texts and evaluated its performance.<n>Our findings offer a valuable tool for maintaining academic integrity and contribute to the broader field of AI ethics.
arXiv Detail & Related papers (2025-11-26T03:16:57Z) - Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection [71.59834293521074]
We develop a framework to distinguish between human-authored and machine-generated text.<n>Our method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset.<n>Code, pretrained weights, and demo will be released.
arXiv Detail & Related papers (2025-10-07T08:14:45Z) - LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text [39.58172554437255]
We introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection.<n>Our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection.<n>We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models.
arXiv Detail & Related papers (2025-09-25T14:59:43Z) - Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
This study systematically evaluations twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset.<n>Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection [11.265512559447986]
We introduce RU-AI, a new large-scale multimodal dataset for robust and effective detection of machine-generated content in text, image and voice.<n>Our dataset is constructed on the basis of three large publicly available datasets: Flickr8K, COCO and Places205.<n>The results reveal that existing models still struggle to achieve accurate and robust detection on our dataset.
arXiv Detail & Related papers (2024-06-07T12:58:14Z) - StyloAI: Distinguishing AI-Generated Content with Stylometric Analysis [0.0]
This study proposes StyloAI, a data-driven model that uses 31 stylometric features to identify AI-generated texts.
StyloAI achieves accuracy rates of 81% and 98% on the test set of the AuTextification dataset and the Education dataset, respectively.
arXiv Detail & Related papers (2024-05-16T14:28:01Z) - RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic
Features for Distinguishing AI-Generated and Human-Written Texts [0.8437187555622164]
This article investigates the problem of AI-generated text detection from two different aspects: semantics and syntax.
We present an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks.
arXiv Detail & Related papers (2024-02-19T00:40:17Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.