Studying Up Machine Learning Data: Why Talk About Bias When We Mean
Power?
- URL: http://arxiv.org/abs/2109.08131v1
- Date: Thu, 16 Sep 2021 17:38:26 GMT
- Title: Studying Up Machine Learning Data: Why Talk About Bias When We Mean
Power?
- Authors: Milagros Miceli, Julian Posada, Tianling Yang
- Abstract summary: We argue that reducing societal problems to "bias" misses the context-based nature of data.
We highlight the corporate forces and market imperatives involved in the labor of data workers that subsequently shape ML datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Research in machine learning (ML) has primarily argued that models trained on
incomplete or biased datasets can lead to discriminatory outputs. In this
commentary, we propose moving the research focus beyond bias-oriented framings
by adopting a power-aware perspective to "study up" ML datasets. This means
accounting for historical inequities, labor conditions, and epistemological
standpoints inscribed in data. We draw on HCI and CSCW work to support our
argument, critically analyze previous research, and point at two co-existing
lines of work within our community -- one bias-oriented, the other power-aware.
This way, we highlight the need for dialogue and cooperation in three areas:
data quality, data work, and data documentation. In the first area, we argue
that reducing societal problems to "bias" misses the context-based nature of
data. In the second one, we highlight the corporate forces and market
imperatives involved in the labor of data workers that subsequently shape ML
datasets. Finally, we propose expanding current transparency-oriented efforts
in dataset documentation to reflect the social contexts of data design and
production.
Related papers
- Fairness and Bias Mitigation in Computer Vision: A Survey [61.01658257223365]
Computer vision systems are increasingly being deployed in high-stakes real-world applications.
There is a dire need to ensure that they do not propagate or amplify any discriminatory tendencies in historical or human-curated data.
This paper presents a comprehensive survey on fairness that summarizes and sheds light on ongoing trends and successes in the context of computer vision.
arXiv Detail & Related papers (2024-08-05T13:44:22Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Understanding the Dataset Practitioners Behind Large Language Model Development [5.48392160519422]
We define the role of "dataset practitioners" at a technology company, Google.
We conduct semi-structured interviews with a cross-section of these practitioners.
We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it.
arXiv Detail & Related papers (2024-02-21T23:50:37Z) - Data-Augmented and Retrieval-Augmented Context Enrichment in Chinese
Media Bias Detection [16.343223974292908]
We build a dataset with Chinese news reports about COVID-19 which is annotated by our newly designed system.
In Data-Augmented Context Enrichment (DACE), we enlarge the training data; while in Retrieval-Augmented Context Enrichment (RACE), we improve information retrieval methods to select valuable information.
Our results show that both methods outperform our baselines, while the RACE methods are more efficient and have more potential.
arXiv Detail & Related papers (2023-11-02T16:29:49Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Bringing the People Back In: Contesting Benchmark Machine Learning
Datasets [11.00769651520502]
We outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created.
We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets.
arXiv Detail & Related papers (2020-07-14T23:22:13Z) - REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset.
It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.