Beemo: Benchmark of Expert-edited Machine-generated Outputs
- URL: http://arxiv.org/abs/2411.04032v3
- Date: Mon, 17 Mar 2025 12:05:36 GMT
- Title: Beemo: Benchmark of Expert-edited Machine-generated Outputs
- Authors: Ekaterina Artemova, Jason Lucas, Saranya Venkatraman, Jooyoung Lee, Sergei Tilga, Adaku Uchendu, Vladislav Mikhailov,
- Abstract summary: Benchmark of Expert-edited Machine-generated Outputs (Beemo)<n>This paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo)
- Score: 5.246065742294272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo's creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.
Related papers
- WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia [2.255682336735152]
Existing work primarily evaluates MGT detectors on generic generation tasks.<n>We introduce a multilingual, multi-generator, and task-specific benchmark for MGT detection.<n>We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%.
arXiv Detail & Related papers (2025-07-04T08:13:10Z) - Robust and Fine-Grained Detection of AI Generated Texts [0.29569362468768806]
Existing systems often struggle with accurately identifying AI-generated content over shorter texts.
Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts.
We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages.
arXiv Detail & Related papers (2025-04-16T10:29:30Z) - GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck.
Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts.
Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z) - RKadiyala at SemEval-2024 Task 8: Black-Box Word-Level Text Boundary Detection in Partially Machine Generated Texts [0.0]
This paper introduces few reliable approaches for identifying which part of a given text is machine generated at a word level.
We present a comparison with proprietary systems, performance of our model on unseen domains' and generators' texts.
The findings reveal significant improvements in detection accuracy along with comparison on other aspects of detection capabilities.
arXiv Detail & Related papers (2024-10-22T03:21:59Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - Detecting Machine-Generated Texts by Multi-Population Aware Optimization
for Maximum Mean Discrepancy [47.382793714455445]
Machine-generated texts (MGTs) may carry critical risks, such as plagiarism, misleading information, or hallucination issues.
It is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle.
We propose a novel textitmulti-population aware optimization method for MMD called MMD-MP.
arXiv Detail & Related papers (2024-02-25T09:44:56Z) - M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection [69.41274756177336]
Large Language Models (LLMs) have brought an unprecedented surge in machine-generated text (MGT) across diverse channels.
This raises legitimate concerns about its potential misuse and societal implications.
We introduce a new benchmark based on a multilingual, multi-domain, and multi-generator corpus of MGTs -- M4GT-Bench.
arXiv Detail & Related papers (2024-02-17T02:50:33Z) - Deciphering Textual Authenticity: A Generalized Strategy through the Lens of Large Language Semantics for Detecting Human vs. Machine-Generated Text [8.290557547578146]
We introduce a novel system, T5LLMCipher, for detecting machine-generated text using a pretrained T5 encoder combined with LLM embedding sub-clustering.
We find that our approach provides state-of-the-art generalization ability, with an average increase in F1 score on machine-generated text of 19.6% on unseen generators and domains.
arXiv Detail & Related papers (2024-01-17T18:45:13Z) - LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected? [13.813769457594216]
Current research mainly focuses on purely MGT detection without adequately addressing mixed scenarios.
We define mixtext, a form of mixed text involving both AI and human-generated content.
Our findings reveal that existing detectors struggle to identify mixtext, particularly in dealing with subtle modifications and style adaptability.
arXiv Detail & Related papers (2024-01-11T14:44:08Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs)
We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples.
Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.