Related papers: Measures of Complexity for Large Scale Image Datasets

Measures of Complexity for Large Scale Image Datasets

URL: http://arxiv.org/abs/2008.04431v1
Date: Mon, 10 Aug 2020 21:54:23 GMT
Title: Measures of Complexity for Large Scale Image Datasets
Authors: Ameet Annasaheb Rahane and Anbumani Subramanian
Abstract summary: In this work, we build a series of relatively simple methods to measure the complexity of a dataset. We present our analysis using four datasets from the autonomous driving research community - Cityscapes, IDD, BDD and Vistas. Using entropy based metrics, we present a rank-order complexity of these datasets, which we compare with an established rank-order with respect to deep learning.
Score: 0.3655021726150368
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large scale image datasets are a growing trend in the field of machine learning. However, it is hard to quantitatively understand or specify how various datasets compare to each other - i.e., if one dataset is more complex or harder to ``learn'' with respect to a deep-learning based network. In this work, we build a series of relatively computationally simple methods to measure the complexity of a dataset. Furthermore, we present an approach to demonstrate visualizations of high dimensional data, in order to assist with visual comparison of datasets. We present our analysis using four datasets from the autonomous driving research community - Cityscapes, IDD, BDD and Vistas. Using entropy based metrics, we present a rank-order complexity of these datasets, which we compare with an established rank-order with respect to deep learning.

Related papers

What Matters in Learning from Large-Scale Datasets for Robot Manipulation [12.703188997313223]
We conduct a large-scale dataset composition study to answer this question.<n>We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets.<n>We find that camera poses and spatial arrangements are crucial dimensions for both diversity in collection and alignment in retrieval.
arXiv Detail & Related papers (2025-06-16T14:25:29Z)
Anymate: A Dataset and Baselines for Learning 3D Object Rigging [18.973312365787137]
We present a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information.<n>We propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction.<n>Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning.
arXiv Detail & Related papers (2025-05-09T17:59:33Z)
Making Sense of Data in the Wild: Data Analysis Automation at Scale [0.1747623282473278]
We propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks.
arXiv Detail & Related papers (2025-01-27T10:04:10Z)
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z)
MASSTAR: A Multi-Modal and Large-Scale Scene Dataset with a Versatile Toolchain for Surface Prediction and Completion [25.44529512862336]
MASSTAR is a multi-modal lArge-scale scene dataset with a verSatile Toolchain for surfAce pRediction and completion. We develop a versatile and efficient toolchain for processing the raw 3D data from the environments. We generate an example dataset composed of over a thousand scene-level models with partial real-world data.
arXiv Detail & Related papers (2024-03-18T11:35:18Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps. Our dataset is collected in both forms of 2D images and 3D point clouds. Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
Joint Geometric and Topological Analysis of Hierarchical Datasets [7.098759778181621]
In this paper, we focus on high-dimensional data that are organized into several hierarchical datasets. The main novelty in this work lies in the combination of two powerful data-analytic approaches: topological data analysis and geometric manifold learning. We show that our new method gives rise to superior classification results compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-04-03T13:02:00Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations. Our framework well preserves the relations between samples. By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
Dataset Condensation with Gradient Matching [36.14340188365505]
We propose a training set synthesis technique for data-efficient learning, called dataset Condensation, that learns to condense large dataset into a small set of informative synthetic samples for training deep neural networks from scratch. We rigorously evaluate its performance in several computer vision benchmarks and demonstrate that it significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2020-06-10T16:30:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.