Related papers: The Rise of AI-Generated Content in Wikipedia

The Rise of AI-Generated Content in Wikipedia

URL: http://arxiv.org/abs/2410.08044v1
Date: Thu, 10 Oct 2024 15:36:10 GMT
Title: The Rise of AI-Generated Content in Wikipedia
Authors: Creston Brooks, Samuel Eggert, Denis Peskoff,
Abstract summary: We use GPTZero, a proprietary AI detector, and Binoculars, an open-source alternative, to establish lower bounds on the presence of AI-generated content in recently created Wikipedia pages. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint.
Score: 1.3654846342364308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rise of AI-generated content in popular information sources raises significant concerns about accountability, accuracy, and bias amplification. Beyond directly impacting consumers, the widespread presence of this content poses questions for the long-term viability of training language models on vast internet sweeps. We use GPTZero, a proprietary AI detector, and Binoculars, an open-source alternative, to establish lower bounds on the presence of AI-generated content in recently created Wikipedia pages. Both detectors reveal a marked increase in AI-generated content in recent pages compared to those from before the release of GPT-3.5. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics.

Related papers

Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia [0.0]
We estimate the causal impact of Google's AI Overview on Wikipedia traffic.<n>Across 161,382 matched article-language pairs, AIO exposure reduces daily traffic to English articles by approximately 15%.<n>These findings provide early causal evidence that generative-answer features in search engines can materially reallocate attention away from informational publishers.
arXiv Detail & Related papers (2026-02-05T01:31:44Z)
AI-Generated Algorithmic Virality [1.8142288667655782]
AI-generated content is said to be highly effective in "gaming the algorithm" and going viral.<n>Popularly referred to as "AI slop," this phenomenon arguably leads to the presence of sloppy and potentially deceptive content.<n>This investigation offers a systematic analysis of AI-generated content and its labelling in TikTok's and Instagram's search results across 13 hashtags.
arXiv Detail & Related papers (2025-08-01T19:41:27Z)
Could AI Trace and Explain the Origins of AI-Generated Images and Text? [53.11173194293537]
AI-generated content is increasingly prevalent in the real world. adversaries might exploit large multimodal models to create images that violate ethical or legal standards. Paper reviewers may misuse large language models to generate reviews without genuine intellectual effort.
arXiv Detail & Related papers (2025-04-05T20:51:54Z)
Delving into: the quantification of Ai-generated content on the internet (synthetic data) [0.0]
At least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%. Given the implications of autophagous loops, this is a sobering realization.
arXiv Detail & Related papers (2025-03-29T03:06:53Z)
Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. We systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset. Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z)
REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives [4.427603894929721]
This study proposes a novel approach to enhancing Wikipedia's B and C category biography articles. By utilizing a multi-staged retrieval-augmented generation technique, we aim to enrich the informational content of lesser-known articles.
arXiv Detail & Related papers (2025-02-17T18:53:42Z)
Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection [60.09665704993751]
We introduce FairOPT, an algorithm for group-specific threshold optimization in AI-generated content classifiers. Our approach partitions data into subgroups based on attributes (e.g., text length and writing style) and learns decision thresholds for each group. Our framework paves the way for more robust and fair classification criteria in AI-generated output detection.
arXiv Detail & Related papers (2025-02-06T21:58:48Z)
The Value of AI-Generated Metadata for UGC Platforms: Evidence from a Large-scale Field Experiment [6.8951681566687055]
We conducted a field experiment on a short-video platform in Asia to provide about 1 million users access to AI-generated titles for their uploaded videos. Our findings show that the provision of AI-generated titles significantly boosted content consumption, increasing valid watches by 1.6% and watch duration by 0.9%. This viewership-boost effect was largely attributed to the use of this generative AI (GAI) tool increasing the likelihood of videos having a title by 41.4%.
arXiv Detail & Related papers (2024-12-24T10:47:27Z)
Suspected Undeclared Use of Artificial Intelligence in the Academic Literature: An Analysis of the Academ-AI Dataset [0.0]
Academ-AI documents examples of suspected undeclared AI usage in the academic literature. Undeclared AI seems to appear in journals with higher citation metrics and higher article processing charges.
arXiv Detail & Related papers (2024-11-20T21:29:36Z)
Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content. We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z)
Disclosure of AI-Generated News Increases Engagement but Does Not Reduce Aversion, Despite Positive Quality Ratings [3.036383058306671]
The integration of AI in journalism presents both opportunities and risks for democracy. This study investigates the perceived quality of AI-assisted and AI-generated versus human-generated news articles.
arXiv Detail & Related papers (2024-09-05T13:12:16Z)
Lying Blindly: Bypassing ChatGPT's Safeguards to Generate Hard-to-Detect Disinformation Claims [0.7478304180168185]
This study explores the capability of ChatGPT with GPT-3.5 to generate short-form disinformation claims about the war in Ukraine. We do not provide the model with human-written disinformation narratives by including them in the prompt. We show that ChatGPT can produce realistic, target-specific disinformation claims, even on a specific post-cutoff event.
arXiv Detail & Related papers (2024-02-13T13:50:08Z)
Orphan Articles: The Dark Matter of Wikipedia [13.290424502717734]
We conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles. We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia. We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility.
arXiv Detail & Related papers (2023-06-06T18:04:33Z)
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking. We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z)
A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? [112.12974778019304]
generative AI (AIGC, a.k.a AI-generated content) has made headlines everywhere because of its ability to analyze and create text, images, and beyond. In the era of AI transitioning from pure analysis to creation, it is worth noting that ChatGPT, with its most recent language model GPT-4, is just a tool out of numerous AIGC tasks. This work focuses on the technological development of various AIGC tasks based on their output type, including text, images, videos, 3D content, etc.
arXiv Detail & Related papers (2023-03-21T10:09:47Z)
Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims. Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z)
Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation. We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z)
Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia. This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention. We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.