ZeroVL: A Strong Baseline for Aligning Vision-Language Representations
with Limited Resources
- URL: http://arxiv.org/abs/2112.09331v1
- Date: Fri, 17 Dec 2021 05:40:28 GMT
- Title: ZeroVL: A Strong Baseline for Aligning Vision-Language Representations
with Limited Resources
- Authors: Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie
- Abstract summary: We provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources.
We collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods.
Our code and pre-trained models will be released to facilitate the research community.
- Score: 13.30815073857842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have
revealed the potential of aligning multi-modal representations with contrastive
learning. However, these works require a tremendous amount of data and
computational resources (e.g., billion-level web data and hundreds of GPUs),
which prevent researchers with limited resources from reproduction and further
exploration. To this end, we explore a stack of simple but effective
heuristics, and provide a comprehensive training guidance, which allows us to
conduct dual-encoder multi-modal representation alignment with limited
resources. We provide a reproducible strong baseline of competitive results,
namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100
GPUs. Additionally, we collect 100M web data for pre-training, and achieve
comparable or superior results than state-of-the-art methods, further proving
the effectiveness of our method on large-scale data. We hope that this work
will provide useful data points and experience for future research in
multi-modal pre-training. Our code and pre-trained models will be released to
facilitate the research community.
Related papers
- Efficient Multimodal Learning from Data-centric Perspective [21.35857180519653]
We introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning.
Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks.
arXiv Detail & Related papers (2024-02-18T10:09:10Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Lessons learned from the NeurIPS 2021 MetaDL challenge: Backbone
fine-tuning without episodic meta-learning dominates for few-shot learning
image classification [40.901760230639496]
We describe the design of the MetaDL competition series, the datasets, the best experimental results, and the top-ranked methods in the NeurIPS 2021 challenge.
The solutions of the top participants have been open-sourced.
arXiv Detail & Related papers (2022-06-15T10:27:23Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z) - Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - Motivating Learners in Multi-Orchestrator Mobile Edge Learning: A
Stackelberg Game Approach [54.28419430315478]
Mobile Edge Learning enables distributed training of Machine Learning models over heterogeneous edge devices.
In MEL, the training performance deteriorates without the availability of sufficient training data or computing resources.
We propose an incentive mechanism, where we formulate the orchestrators-learners interactions as a 2-round Stackelberg game.
arXiv Detail & Related papers (2021-09-25T17:27:48Z) - Multimodal Prototypical Networks for Few-shot Learning [20.100480009813953]
Cross-modal feature generation framework is used to enrich the low populated embedding space in few-shot scenarios.
We show that in such cases nearest neighbor classification is a viable approach and outperform state-of-the-art single-modal and multimodal few-shot learning methods.
arXiv Detail & Related papers (2020-11-17T19:32:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.