Related papers: Consent in Crisis: The Rapid Decline of the AI Data Commons

Consent in Crisis: The Rapid Decline of the AI Data Commons

URL: http://arxiv.org/abs/2407.14933v2
Date: Wed, 24 Jul 2024 16:52:51 GMT
Title: Consent in Crisis: The Rapid Decline of the AI Data Commons
Authors: Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland,
Abstract summary: General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data. We conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora.
Score: 74.68176012363253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

Related papers

Could AI Trace and Explain the Origins of AI-Generated Images and Text? [53.11173194293537]
AI-generated content is increasingly prevalent in the real world. adversaries might exploit large multimodal models to create images that violate ethical or legal standards. Paper reviewers may misuse large language models to generate reviews without genuine intellectual effort.
arXiv Detail & Related papers (2025-04-05T20:51:54Z)
Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. We systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset. Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z)
Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity? [60.629883024152576]
Large Language Models (LLMs) have seen rapid deployment in a wide range of use cases. OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities. These methods come with a range of cybersecurity challenges.
arXiv Detail & Related papers (2024-12-19T14:44:41Z)
A Survey on Offensive AI Within Cybersecurity [1.8206461789819075]
This survey paper on offensive AI will comprehensively cover various aspects related to attacks against and using AI systems. It will delve into the impact of offensive AI practices on different domains, including consumer, enterprise, and public digital infrastructure. The paper will explore adversarial machine learning, attacks against AI models, infrastructure, and interfaces, along with offensive techniques like information gathering, social engineering, and weaponized AI.
arXiv Detail & Related papers (2024-09-26T17:36:22Z)
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools [32.78336381381673]
We report on the first preregistered empirical evaluation of AI-driven legal research tools. We find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. It provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs.
arXiv Detail & Related papers (2024-05-30T17:56:05Z)
Artificial Intelligence Index Report 2024 [15.531650534547945]
The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI) The AI Index is recognized globally as one of the most credible and authoritative sources for data and insights on AI. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that AI is coming to hold in all of our lives.
arXiv Detail & Related papers (2024-05-29T20:59:57Z)
Fairness in AI and Its Long-Term Implications on Society [68.8204255655161]
We take a closer look at AI fairness and analyze how lack of AI fairness can lead to deepening of biases over time. We discuss how biased models can lead to more negative real-world outcomes for certain groups. If the issues persist, they could be reinforced by interactions with other risks and have severe implications on society in the form of social unrest.
arXiv Detail & Related papers (2023-04-16T11:22:59Z)
Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data. We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more. We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z)
Proceedings of the Artificial Intelligence for Cyber Security (AICS) Workshop at AAAI 2022 [55.573187938617636]
The workshop will focus on the application of AI to problems in cyber security. Cyber systems generate large volumes of data, utilizing this effectively is beyond human capabilities.
arXiv Detail & Related papers (2022-02-28T18:27:41Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Structured access to AI capabilities: an emerging paradigm for safe AI deployment [0.0]
Instead of openly disseminating AI systems, developers facilitate controlled, arm's length interactions with their AI systems. Aim is to prevent dangerous AI capabilities from being widely accessible, whilst preserving access to AI capabilities that can be used safely.
arXiv Detail & Related papers (2022-01-13T19:30:16Z)
Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.