WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia
- URL: http://arxiv.org/abs/2507.03373v1
- Date: Fri, 04 Jul 2025 08:13:10 GMT
- Title: WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia
- Authors: Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl,
- Abstract summary: Existing work primarily evaluates MGT detectors on generic generation tasks.<n>We introduce a multilingual, multi-generator, and task-specific benchmark for MGT detection.<n>We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%.
- Score: 2.255682336735152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.
Related papers
- GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge [71.69373986176839]
We aim to answer whether models can detect generated text from a large, yet fixed, number of domains and LLMs.<n>Over the course of three months, our task was attempted by 9 teams with 23 detector submissions.<n>We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate.
arXiv Detail & Related papers (2025-01-15T16:21:09Z) - On the Generalization and Adaptation Ability of Machine-Generated Text Detectors in Academic Writing [23.434925348283617]
This work investigates the generalization and adaptation capabilities of MGT detectors in three key aspects specific to academic writing.<n>We benchmark the performance of various detectors for binary classification and attribution tasks in both in-domain and cross-domain settings.<n>Our findings provide insights into the generalization and adaptation ability of MGT detectors across diverse scenarios and lay the foundation for building robust, adaptive detection systems.
arXiv Detail & Related papers (2024-12-23T03:30:34Z) - Beemo: Benchmark of Expert-edited Machine-generated Outputs [5.246065742294272]
Benchmark of Expert-edited Machine-generated Outputs (Beemo)<n>This paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo)
arXiv Detail & Related papers (2024-11-06T16:31:28Z) - GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck.
Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts.
Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z) - Machine-Generated Text Localization [16.137882615106523]
Prior work has primarily formulated MGT detection as a binary classification task over an entire document.
This paper provides the first in-depth study of MGT that localizes the portions of a document that were machine generated.
A gain of 4-13% mean Average Precision (mAP) over prior work demonstrates the effectiveness of approach on five diverse datasets.
arXiv Detail & Related papers (2024-02-19T00:07:28Z) - M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection [69.41274756177336]
Large Language Models (LLMs) have brought an unprecedented surge in machine-generated text (MGT) across diverse channels.
This raises legitimate concerns about its potential misuse and societal implications.
We introduce a new benchmark based on a multilingual, multi-domain, and multi-generator corpus of MGTs -- M4GT-Bench.
arXiv Detail & Related papers (2024-02-17T02:50:33Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data
Limitation With Contrastive Learning [14.637303913878435]
We present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario.
To exploit the linguistic feature, we encode coherence information in form of graph into text representation.
Experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly.
arXiv Detail & Related papers (2022-12-20T15:26:19Z) - Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation [92.1582872870226]
We propose a new grounded keys-to-text generation task.
The task is to generate a factual description about an entity given a set of guiding keys, and grounding passages.
Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions.
arXiv Detail & Related papers (2022-12-04T23:59:41Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.