Related papers: Distributionally Robust Classification on a Data Budget

Distributionally Robust Classification on a Data Budget

URL: http://arxiv.org/abs/2308.03821v1
Date: Mon, 7 Aug 2023 15:30:02 GMT
Title: Distributionally Robust Classification on a Data Budget
Authors: Benjamin Feuer, Ameya Joshi, Minh Pham, Chinmay Hegde
Abstract summary: We show that standard ResNet-50 trained with the cross-entropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. This is the first result showing (near) state-of-the-art distributional robustness on limited data budgets.
Score: 26.69877485937123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real world uses of deep learning require predictable model behavior under distribution shifts. Models such as CLIP show emergent natural distributional robustness comparable to humans, but may require hundreds of millions of training samples. Can we train robust learners in a domain where data is limited? To rigorously address this question, we introduce JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and perform a series of carefully controlled investigations of factors contributing to robustness in image classification, then compare those results to findings derived from a large-scale meta-analysis. Using this approach, we show that standard ResNet-50 trained with the cross-entropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. To our knowledge, this is the first result showing (near) state-of-the-art distributional robustness on limited data budgets. Our dataset is available at \url{https://huggingface.co/datasets/penfever/JANuS_dataset}, and the code used to reproduce our experiments can be found at \url{https://github.com/penfever/vlhub/}.

Related papers

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets [4.833815605196965]
This paper presents a novel method for addressing data imbalance in machine learning. It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering. It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
arXiv Detail & Related papers (2023-08-19T02:11:49Z)
On the Connection between Pre-training Data Diversity and Fine-tuning Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z)
Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition. Specifically, we utilize the web-collected Coyo-700M dataset. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z)
Incorporating Crowdsourced Annotator Distributions into Ensemble Modeling to Improve Classification Trustworthiness for Ancient Greek Papyri [3.870354915766567]
Two issues which complicate the problem on such datasets are class imbalance and ground-truth uncertainty in labeling. The application of ensemble modeling to such datasets can help identify images where the ground-truth is questionable and quantify the trustworthiness of those samples.
arXiv Detail & Related papers (2022-10-28T19:39:14Z)
Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV) NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z)
GDC- Generalized Distribution Calibration for Few-Shot Learning [5.076419064097734]
Few shot learning is an important problem in machine learning as large labelled datasets take considerable time and effort to assemble. Most few-shot learning algorithms suffer from one of two limitations- they either require the design of sophisticated models and loss functions, thus hampering interpretability. We propose a Generalized sampling method that learns to estimate few-shot distributions for classification as weighted random variables of all large classes.
arXiv Detail & Related papers (2022-04-11T16:22:53Z)
KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities. Fine-tuning trained models to new samples can be achieved by simply adding them to the table. Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z)
BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [93.38239238988719]
We propose to enable deep neural networks with the ability to learn the sample relationships from each mini-batch. BatchFormer is applied into the batch dimension of each mini-batch to implicitly explore sample relationships during training. We perform extensive experiments on over ten datasets and the proposed method achieves significant improvements on different data scarcity applications.
arXiv Detail & Related papers (2022-03-03T05:31:33Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
Feature Generation for Long-tail Classification [36.186909933006675]
We show how to generate meaningful features by estimating the tail category's distribution. We also present a qualitative analysis of generated features using t-SNE visualizations and analyze the nearest neighbors used to calibrate the tail class distributions.
arXiv Detail & Related papers (2021-11-10T21:34:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.