On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models
- URL: http://arxiv.org/abs/2108.00114v1
- Date: Sat, 31 Jul 2021 00:08:21 GMT
- Title: On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models
- Authors: Zeyad Emam, Andrew Kondrich, Sasha Harrison, Felix Lau, Yushi Wang,
Aerin Kim, Elliot Branson
- Abstract summary: High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML)
Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant.
Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality labeled datasets play a crucial role in fueling the development
of machine learning (ML), and in particular the development of deep learning
(DL). However, since the emergence of the ImageNet dataset and the AlexNet
model in 2012, the size of new open-source labeled vision datasets has remained
roughly constant. Consequently, only a minority of publications in the computer
vision community tackle supervised learning on datasets that are orders of
magnitude larger than Imagenet. In this paper, we survey computer vision
research domains that study the effects of such large datasets on model
performance across different vision tasks. We summarize the community's current
understanding of those effects, and highlight some open questions related to
training with massive datasets. In particular, we tackle: (a) The largest
datasets currently used in computer vision research and the interesting
takeaways from training on such datasets; (b) The effectiveness of pre-training
on large datasets; (c) Recent advancements and hurdles facing synthetic
datasets; (d) An overview of double descent and sample non-monotonicity
phenomena; and finally, (e) A brief discussion of lifelong/continual learning
and how it fares compared to learning from huge labeled datasets in an offline
setting. Overall, our findings are that research on optimization for deep
learning focuses on perfecting the training routine and thus making DL models
less data hungry, while research on synthetic datasets aims to offset the cost
of data labeling. However, for the time being, acquiring non-synthetic labeled
data remains indispensable to boost performance.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - SynDrone -- Multi-modal UAV Dataset for Urban Scenarios [11.338399194998933]
The scarcity of large-scale real datasets with pixel-level annotations poses a significant challenge to researchers.
We propose a multimodal synthetic dataset containing both images and 3D data taken at multiple flying heights.
The dataset will be made publicly available to support the development of novel computer vision methods targeting UAV applications.
arXiv Detail & Related papers (2023-08-21T06:22:10Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - WorldGen: A Large Scale Generative Simulator [12.886022807173337]
We present WorldGen, an open source framework to autonomously generate countless structured and unstructured 3D photorealistic scenes.
WorldGen gives the user full access and control to features such as texture, object structure, motion, camera and lens properties for better generalizability.
arXiv Detail & Related papers (2022-10-03T05:07:42Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data.
We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Deflating Dataset Bias Using Synthetic Data Augmentation [8.509201763744246]
State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning.
The goal of this paper is to investigate the use of targeted synthetic data augmentation for filling gaps in real datasets for vision tasks.
Empirical studies on three different computer vision tasks of practical use to AVs consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance.
arXiv Detail & Related papers (2020-04-28T21:56:10Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.