Vision-Language Dataset Distillation
- URL: http://arxiv.org/abs/2308.07545v3
- Date: Wed, 7 Feb 2024 18:57:27 GMT
- Title: Vision-Language Dataset Distillation
- Authors: Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky
- Abstract summary: We design the first vision-language dataset distillation method, building on the idea of trajectory matching.
A key challenge is that vision-language datasets do not have a set of discrete classes.
Our proposed method jointly distills the image-text pairs in a contrastive formulation.
- Score: 29.371308478925446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset distillation methods reduce large-scale datasets to smaller sets of
synthetic data, which preserve sufficient information for quickly training a
new model from scratch. However, prior work on dataset distillation has focused
exclusively on image classification datasets, whereas modern large-scale
datasets are primarily in the vision-language space. In this work, we design
the first vision-language dataset distillation method, building on the idea of
trajectory matching. A key challenge is that vision-language datasets do not
have a set of discrete classes. To overcome this, our proposed method jointly
distills the image-text pairs in a contrastive formulation. Further, we
leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and
effective trajectory matching in complex modern vision-language models. Since
there are no existing baselines, we compare our distillation approach to three
adapted vision-language coreset selection methods. We demonstrate significant
improvements on the challenging Flickr30K and COCO retrieval benchmarks: for
example, on Flickr30K, the best coreset selection method selecting 1000
image-text pairs for training achieves only 5.6% image-to-text retrieval
accuracy (i.e., recall@1); in contrast, our dataset distillation approach
almost doubles that to 9.9% with just 100 (an order of magnitude fewer)
training pairs.
Related papers
- Low-Rank Similarity Mining for Multimodal Dataset Distillation [50.45577048854653]
We propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation.
LoRS distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability.
arXiv Detail & Related papers (2024-06-06T07:05:20Z) - Dataset Distillation via Adversarial Prediction Matching [24.487950991247764]
We propose an adversarial framework to solve the dataset distillation problem efficiently.
Our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets.
arXiv Detail & Related papers (2023-12-14T13:19:33Z) - DataDAM: Efficient Dataset Distillation with Attention Matching [15.300968899043498]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets.
Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset.
However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z) - No Data Augmentation? Alternative Regularizations for Effective Training
on Small Datasets [0.0]
We study alternative regularization strategies to push the limits of supervised learning on small image classification datasets.
In particular, we employ a agnostic to select (semi) optimal learning rate and weight decay couples via the norm of model parameters.
We reach a test accuracy of 66.5%, on par with the best state-of-the-art methods.
arXiv Detail & Related papers (2023-09-04T16:13:59Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Filtering, Distillation, and Hard Negatives for Vision-Language
Pre-Training [36.57863211656931]
Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems.
In this paper we improve the following three aspects of the contrastive pre-training pipeline.
First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size.
Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training.
arXiv Detail & Related papers (2023-01-05T19:48:01Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.