Data Governance in the Age of Large-Scale Data-Driven Language
Technology
- URL: http://arxiv.org/abs/2206.03216v2
- Date: Wed, 2 Nov 2022 21:18:30 GMT
- Title: Data Governance in the Age of Large-Scale Data-Driven Language
Technology
- Authors: Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim
Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant
Subramani, G\'erard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac
Johnson, Dragomir Radev, Somaieh Nikpoor, J\"org Frohberg, Aaron Gokaslan,
Peter Henderson, Rishi Bommasani, Margaret Mitchell
- Abstract summary: This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights.
The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
- Score: 79.92626780294258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent emergence and adoption of Machine Learning technology, and
specifically of Large Language Models, has drawn attention to the need for
systematic and transparent management of language data. This work proposes an
approach to global language data governance that attempts to organize data
management amongst stakeholders, values, and rights. Our proposal is informed
by prior work on distributed governance that accounts for human values and
grounded by an international research collaboration that brings together
researchers and practitioners from 60 countries. The framework we present is a
multi-party international governance structure focused on language data, and
incorporating technical and organizational tools needed to support its work.
Related papers
- Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection.
We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts.
Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z) - Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Empowering Domain-Specific Language Models with Graph-Oriented Databases: A Paradigm Shift in Performance and Model Maintenance [0.0]
Our work is driven by the need to manage and process large volumes of short text documents inherent in specific application domains.
By leveraging domain-specific knowledge and expertise, our approach aims to shape factual data within these domains.
Our work underscores the transformative potential of the partnership of domain-specific language models and graph-oriented databases.
arXiv Detail & Related papers (2024-10-04T19:02:09Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Layers of technology in pluriversal design. Decolonising language technology with the LiveLanguage initiative [9.063726739562227]
This paper uses LiveLanguage, a lexical database, as an example to discuss and close the gap from pluriversal design theory to practice.
The paper presents a model comprising of five layers of technological activity.
arXiv Detail & Related papers (2024-05-02T23:52:39Z) - A Systematic Study of Performance Disparities in Multilingual
Task-Oriented Dialogue Systems [68.76102493999134]
We take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue systems.
We prove the existence of the adaptation and intrinsic biases in current ToD systems.
Our analyses offer practical tips on how to approach ToD data collection and system development for new languages.
arXiv Detail & Related papers (2023-10-19T16:41:44Z) - Mapping and Comparing Data Governance Frameworks: A benchmarking
exercise to inform global data governance deliberations [0.0]
Article explores the increasing importance of global data governance due to the rapid growth of data and the need for responsible data use and protection.
The report highlights the need for a more holistic, coordinated approach to data governance to manage the global flow of data responsibly and for the public interest.
arXiv Detail & Related papers (2023-02-27T12:56:25Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - Systematic Inequalities in Language Technology Performance across the
World's Languages [94.65681336393425]
We introduce a framework for estimating the global utility of language technologies.
Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies and more linguistic NLP tasks.
arXiv Detail & Related papers (2021-10-13T14:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.