Related papers: LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

URL: http://arxiv.org/abs/2509.21269v1
Date: Thu, 25 Sep 2025 14:59:43 GMT
Title: LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text
Authors: Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Maksim Kuprashevich,
Abstract summary: We introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection.<n>Our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection.<n>We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models.
Score: 39.58172554437255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.

Related papers

Detecting LLM-Generated Text with Performance Guarantees [13.29284903739996]
Large language models (LLMs) such as GPT, Claude, Gemini, and Grok have been deeply integrated into our daily life.<n>They now support a wide range of tasks -- from dialogue and email drafting to assisting with teaching and coding.<n>Their ability to produce highly human-like text raises serious concerns, including the spread of fake news.
arXiv Detail & Related papers (2026-01-10T14:52:45Z)
A Comprehensive Dataset for Human vs. AI Generated Text Detection [23.0218614564443]
We present a comprehensive dataset comprising over 58,000 text samples from authentic New York Times articles.<n>The dataset provides original article abstracts as prompts, full human-authored narratives.<n>We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, and attributing AI texts to their generating models with an accuracy of 8.92%.
arXiv Detail & Related papers (2025-10-26T23:50:52Z)
mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection [3.562613318511706]
An automated detection is able to assist humans to indicate the machine-generated texts.<n>This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification.<n>It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance (1st rank) in both.
arXiv Detail & Related papers (2025-06-02T14:07:32Z)
SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation [55.61004653386632]
Large Language Models (LLMs) often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context.<n>This paper introduces a novel self-supervised method for generating a training set of unfaithful samples.<n>We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones.
arXiv Detail & Related papers (2025-02-19T12:31:58Z)
GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck. Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts. Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z)
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT [9.682499180341273]
Large language models (LLMs) have significantly advanced text generation, but the human-like quality of their outputs presents major challenges.<n>We propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English.<n>This framework supports scalable, reproducible experiments and enables analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance.
arXiv Detail & Related papers (2024-06-13T12:43:40Z)
RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts [0.8437187555622164]
This article investigates the problem of AI-generated text detection from two different aspects: semantics and syntax. We present an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks.
arXiv Detail & Related papers (2024-02-19T00:40:17Z)
ToBlend: Token-Level Blending With an Ensemble of LLMs to Attack AI-Generated Text Detection [6.27025292177391]
ToBlend is a novel token-level ensemble text generation method to challenge the robustness of current AI-content detection approaches. We find ToBlend significantly drops the performance of most mainstream AI-content detection methods.
arXiv Detail & Related papers (2024-02-17T02:25:57Z)
Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection. We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs. Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.