Delving into: the quantification of Ai-generated content on the internet (synthetic data)
- URL: http://arxiv.org/abs/2504.08755v1
- Date: Sat, 29 Mar 2025 03:06:53 GMT
- Title: Delving into: the quantification of Ai-generated content on the internet (synthetic data)
- Authors: Dirk HR Spennemann,
- Abstract summary: At least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%.<n>Given the implications of autophagous loops, this is a sobering realization.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While it is increasingly evident that the internet is becoming saturated with content created by generated Ai large language models, accurately measuring the scale of this phenomenon has proven challenging. By analyzing the frequency of specific keywords commonly used by ChatGPT, this paper demonstrates that such linguistic markers can effectively be used to esti-mate the presence of generative AI content online. The findings suggest that at least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%. Given the implications of autophagous loops, this is a sobering realization.
Related papers
- Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images [96.43608872116347]
AnomReason is a large-scale benchmark with structured annotations as quadruple textbfAnomAgent<n>AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images.
arXiv Detail & Related papers (2025-10-11T14:09:24Z) - AI-Generated Algorithmic Virality [1.8142288667655782]
AI-generated content is said to be highly effective in "gaming the algorithm" and going viral.<n>Popularly referred to as "AI slop," this phenomenon arguably leads to the presence of sloppy and potentially deceptive content.<n>This investigation offers a systematic analysis of AI-generated content and its labelling in TikTok's and Instagram's search results across 13 hashtags.
arXiv Detail & Related papers (2025-08-01T19:41:27Z) - The power of text similarity in identifying AI-LLM paraphrased documents: The case of BBC news articles and ChatGPT [2.024925013349319]
We demonstrate the ability of pattern-based similarity detection for AI paraphrased news recognition.<n>We propose an algorithmic scheme, which is not limited to detect whether an article is an AI paraphrase, but, more importantly, to identify that the source of infringement is the ChatGPT.<n>Results show that our pattern similarity-based method, that makes no use of deep learning, can detect ChatGPT assisted paraphrased articles at percentages 96.23% for accuracy, 96.25% for precision, 96.21% for sensitivity, 96.25% for specificity and 96.23% for F1 score.
arXiv Detail & Related papers (2025-05-18T13:16:30Z) - Could AI Trace and Explain the Origins of AI-Generated Images and Text? [53.11173194293537]
AI-generated content is increasingly prevalent in the real world.<n> adversaries might exploit large multimodal models to create images that violate ethical or legal standards.<n>Paper reviewers may misuse large language models to generate reviews without genuine intellectual effort.
arXiv Detail & Related papers (2025-04-05T20:51:54Z) - Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content.<n>We systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset.<n>Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z) - Generative AI in Academic Writing: A Comparison of DeepSeek, Qwen, ChatGPT, Gemini, Llama, Mistral, and Gemma [0.9562145896371785]
Alibaba released its AI model, Qwen 2.5 Max, on January 29, 2025.<n>This study aims to evaluate the academic writing performance of both Qwen 2.5 Max and DeepSeek v3.
arXiv Detail & Related papers (2025-02-11T18:33:22Z) - EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems [103.91826112815384]
citation-based QA systems are suffering from two shortcomings.
They usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system.
We propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system.
arXiv Detail & Related papers (2024-06-14T19:40:38Z) - MUGC: Machine Generated versus User Generated Content Detection [1.6602942962521352]
We show that traditional methods demonstrate a high level of accuracy in identifying machine-generated data.
Machine-generated texts tend to be shorter and exhibit less word variety compared to human-generated content.
readability, bias, moral, and affect comparisons reveal a discernible contrast between machine-generated and human generated content.
arXiv Detail & Related papers (2024-03-28T07:33:53Z) - Deep Learning Detection Method for Large Language Models-Generated
Scientific Content [0.0]
Large Language Models generate scientific content that is indistinguishable from that written by humans.
This research paper presents a novel ChatGPT-generated scientific text detection method, AI-Catcher.
On average, AI-Catcher improved accuracy by 37.4%.
arXiv Detail & Related papers (2024-02-27T19:16:39Z) - AI Content Self-Detection for Transformer-based Large Language Models [0.0]
This paper introduces the idea of direct origin detection and evaluates whether generative AI systems can recognize their output and distinguish it from human-written texts.
Google's Bard model exhibits the largest capability of self-detection with an accuracy of 94%, followed by OpenAI's ChatGPT with 83%.
arXiv Detail & Related papers (2023-12-28T10:08:57Z) - The Curse of Recursion: Training on Generated Data Makes Models Forget [70.02793975243212]
Large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images.
We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear.
arXiv Detail & Related papers (2023-05-27T15:10:41Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - Is This Abstract Generated by AI? A Research for the Gap between
AI-generated Scientific Text and Human-written Scientific Text [13.438933219811188]
We investigate the gap between scientific content generated by AI and written by humans.
We find that there exists a writing style'' gap between AI-generated scientific text and human-written scientific text.
arXiv Detail & Related papers (2023-01-24T04:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.