Do Datasets Have Politics? Disciplinary Values in Computer Vision
Dataset Development
- URL: http://arxiv.org/abs/2108.04308v1
- Date: Mon, 9 Aug 2021 19:07:58 GMT
- Title: Do Datasets Have Politics? Disciplinary Values in Computer Vision
Dataset Development
- Authors: Morgan Klaus Scheuerman, Emily Denton, Alex Hanna
- Abstract summary: We collect a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks.
We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; and model work at the expense of data work.
We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.
- Score: 6.182409582844314
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Data is a crucial component of machine learning. The field is reliant on data
to train, validate, and test models. With increased technical capabilities,
machine learning research has boomed in both academic and industry settings,
and one major focus has been on computer vision. Computer vision is a popular
domain of machine learning increasingly pertinent to real-world applications,
from facial recognition in policing to object detection for autonomous
vehicles. Given computer vision's propensity to shape machine learning research
and impact human life, we seek to understand disciplinary practices around
dataset documentation - how data is collected, curated, annotated, and packaged
into datasets for computer vision researchers and practitioners to use for
model tuning and development. Specifically, we examine what dataset
documentation communicates about the underlying values of vision data and the
larger practices and goals of computer vision as a field. To conduct this
study, we collected a corpus of about 500 computer vision datasets, from which
we sampled 114 dataset publications across different vision tasks. Through both
a structured and thematic content analysis, we document a number of values
around accepted data practices, what makes desirable data, and the treatment of
humans in the dataset construction process. We discuss how computer vision
datasets authors value efficiency at the expense of care; universality at the
expense of contextuality; impartiality at the expense of positionality; and
model work at the expense of data work. Many of the silenced values we identify
sit in opposition with social computing practices. We conclude with suggestions
on how to better incorporate silenced values into the dataset creation and
curation process.
Related papers
- BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [57.40024206484446]
We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models.
BVS supports a large number of adjustable parameters at the scene level.
We showcase three example application scenarios.
arXiv Detail & Related papers (2024-05-15T17:57:56Z) - Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices.
We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z) - SeeBel: Seeing is Believing [0.9790236766474201]
We propose three visualizations that enable users to compare dataset statistics and AI performance for segmenting all images.
Our project tries to further increase the interpretability of the trained AI model for segmentation by visualizing its image attention weights.
We propose to conduct surveys on real users to study the efficacy of our visualization tool in computer vision and AI domain.
arXiv Detail & Related papers (2023-12-18T05:11:00Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Privacy-Preserving Graph Machine Learning from Data to Computation: A
Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z) - A Survey on RGB-D Datasets [69.73803123972297]
This paper reviewed and categorized image datasets that include depth information.
We gathered 203 datasets that contain accessible data and grouped them into three categories: scene/objects, body, and medical.
arXiv Detail & Related papers (2022-01-15T05:35:19Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML)
Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant.
Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Shuffler: A Large Scale Data Management Tool for ML in Computer Vision [0.0]
We present Shuffler, an open source tool that makes it easy to manage large computer vision datasets.
Shuffler defines over 40 data handling operations with annotations that are commonly useful in supervised learning applied to computer vision.
arXiv Detail & Related papers (2021-04-11T22:27:28Z) - Data Vision: Learning to See Through Algorithmic Abstraction [6.730787776951012]
Learning to see through data is central to contemporary forms of algorithmic knowledge production.
This paper examines how the often-divergent demands of mechanization and discretion manifest in data analytic learning environments.
arXiv Detail & Related papers (2020-02-09T15:46:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.