Discourse Features Enhance Detection of Document-Level Machine-Generated Content
- URL: http://arxiv.org/abs/2412.12679v2
- Date: Tue, 28 Oct 2025 02:20:41 GMT
- Title: Discourse Features Enhance Detection of Document-Level Machine-Generated Content
- Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller,
- Abstract summary: Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features.<n>We introduce novel methodologies and datasets to overcome these challenges.
- Score: 53.41994768824785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets - 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at: https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated -Content.git.
Related papers
- Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding [42.506971197471195]
We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following.<n>Our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks.
arXiv Detail & Related papers (2025-05-08T17:37:36Z) - TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification [32.958143806547234]
We introduce the Text pROVEnance (TROVE) challenge to trace each sentence of a target text back to specific source sentences.<n>To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios.<n>We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms.
arXiv Detail & Related papers (2025-03-19T15:09:39Z) - VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models [5.713983191152314]
VTechAGP is the first academic-to-general-audience text paraphrase dataset.
We also propose a novel dynamic soft prompt generative language model, DSPT5.
For training, we leverage a contrastive-generative loss function to learn the keyword in the dynamic prompt.
arXiv Detail & Related papers (2024-11-07T16:06:00Z) - QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval [12.225881591629815]
In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching.
Recent studies mainly focus on improving the sentence embedding model or retrieval process.
We introduce a novel text augmentation framework for dense retrieval, which transforms raw documents into information-dense text formats.
arXiv Detail & Related papers (2024-07-29T17:39:08Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles [14.205559299967423]
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts.
Users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale.
To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired.
We propose TopFormer to improve existing AA solutions by capturing more linguistic patterns in deepfake texts.
arXiv Detail & Related papers (2023-09-22T15:32:49Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images.
Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding.
We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Data Augmentation in Natural Language Processing: A Novel Text
Generation Approach for Long and Short Text Classifiers [8.19984844136462]
We present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts.
In a simulated low data regime additive accuracy gains of up to 15.53% are achieved.
We discuss implications and patterns for the successful application of our approach on different types of datasets.
arXiv Detail & Related papers (2021-03-26T13:16:07Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.