Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI
- URL: http://arxiv.org/abs/2204.01075v1
- Date: Sun, 3 Apr 2022 13:49:36 GMT
- Title: Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI
- Authors: Mahima Pushkarna (1), Andrew Zaldivar (1), Oddur Kjartansson (1) ((1)
Google Research)
- Abstract summary: We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As research and industry moves towards large-scale models capable of numerous
downstream tasks, the complexity of understanding multi-modal datasets that
give nuance to models rapidly increases. A clear and thorough understanding of
a dataset's origins, development, intent, ethical considerations and evolution
becomes a necessary step for the responsible and informed deployment of models,
especially those in people-facing contexts and high-risk domains. However, the
burden of this understanding often falls on the intelligibility, conciseness,
and comprehensiveness of the documentation. It requires consistency and
comparability across the documentation of all datasets involved, and as such
documentation must be treated as a user-centric product in and of itself. In
this paper, we propose Data Cards for fostering transparent, purposeful and
human-centered documentation of datasets within the practical contexts of
industry and research. Data Cards are structured summaries of essential facts
about various aspects of ML datasets needed by stakeholders across a dataset's
lifecycle for responsible AI development. These summaries provide explanations
of processes and rationales that shape the data and consequently the models,
such as upstream sources, data collection and annotation methods; training and
evaluation methods, intended use; or decisions affecting model performance. We
also present frameworks that ground Data Cards in real-world utility and
human-centricity. Using two case studies, we report on desirable
characteristics that support adoption across domains, organizational
structures, and audience groups. Finally, we present lessons learned from
deploying over 20 Data Cards.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - CrowdWorkSheets: Accounting for Individual and Collective Identities
Underlying Crowdsourced Dataset Annotation [8.447159556925182]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations.
We introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline.
arXiv Detail & Related papers (2022-06-09T23:31:17Z) - Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z) - Towards Accountability for Machine Learning Datasets: Practices from
Software Engineering and Infrastructure [9.825840279544465]
datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation.
This paper introduces a rigorous framework for dataset development transparency which supports decision-making and accountability.
arXiv Detail & Related papers (2020-10-23T01:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.