DIME-FM: DIstilling Multimodal and Efficient Foundation Models
- URL: http://arxiv.org/abs/2303.18232v2
- Date: Mon, 14 Aug 2023 18:30:40 GMT
- Title: DIME-FM: DIstilling Multimodal and Efficient Foundation Models
- Authors: Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko,
Xide Xia
- Abstract summary: Large Vision-Language Foundation Models (VLFM) are trained on large-scale datasets of image-caption pairs.
We introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models.
The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset.
- Score: 72.1900621000677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and
Florence, are trained on large-scale datasets of image-caption pairs and
achieve superior transferability and robustness on downstream tasks, but they
are difficult to use in many practical applications due to their large size,
high latency and fixed architectures. Unfortunately, recent work shows training
a small custom VLFM for resource-limited applications is currently very
difficult using public and smaller-scale data. In this paper, we introduce a
new distillation mechanism (DIME-FM) that allows us to transfer the knowledge
contained in large VLFMs to smaller, customized foundation models using a
relatively small amount of inexpensive, unpaired images and sentences. We
transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32
model, with only 40M public images and 28.4M unpaired public sentences. The
resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained
on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves
similar results in terms of zero-shot and linear-probing performance on both
ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also
displays comparable robustness when evaluated on five datasets with natural
distribution shifts from ImageNet.
Related papers
- MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval [7.233106731197739]
We propose a Multi-teacher Cross-modality Alignment Distillation (MCAD) technique to integrate the advantages of single- and dual-stream models.
We implement a lightweight CLIP model on Snapdragon/Dimensity chips with only $sim$100M running memory and $sim$8.0ms search latency.
arXiv Detail & Related papers (2023-10-30T15:38:43Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z) - Efficient deep learning models for land cover image classification [0.29748898344267777]
This work experiments with the BigEarthNet dataset for land use land cover (LULC) image classification.
We benchmark different state-of-the-art models, including Convolution Neural Networks, Multi-Layer Perceptrons, Visual Transformers, EfficientNets and Wide Residual Networks (WRN)
Our proposed lightweight model has an order of magnitude less trainable parameters, achieves 4.5% higher averaged f-score classification accuracy for all 19 LULC classes and is trained two times faster with respect to a ResNet50 state-of-the-art model that we use as a baseline.
arXiv Detail & Related papers (2021-11-18T00:03:14Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - MiniVLM: A Smaller and Faster Vision-Language Model [76.35880443015493]
MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module.
MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
arXiv Detail & Related papers (2020-12-13T03:02:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.