ZeroVL: A Strong Baseline for Aligning Vision-Language Representations
with Limited Resources
- URL: http://arxiv.org/abs/2112.09331v1
- Date: Fri, 17 Dec 2021 05:40:28 GMT
- Title: ZeroVL: A Strong Baseline for Aligning Vision-Language Representations
with Limited Resources
- Authors: Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie
- Abstract summary: We provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources.
We collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods.
Our code and pre-trained models will be released to facilitate the research community.
- Score: 13.30815073857842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have
revealed the potential of aligning multi-modal representations with contrastive
learning. However, these works require a tremendous amount of data and
computational resources (e.g., billion-level web data and hundreds of GPUs),
which prevent researchers with limited resources from reproduction and further
exploration. To this end, we explore a stack of simple but effective
heuristics, and provide a comprehensive training guidance, which allows us to
conduct dual-encoder multi-modal representation alignment with limited
resources. We provide a reproducible strong baseline of competitive results,
namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100
GPUs. Additionally, we collect 100M web data for pre-training, and achieve
comparable or superior results than state-of-the-art methods, further proving
the effectiveness of our method on large-scale data. We hope that this work
will provide useful data points and experience for future research in
multi-modal pre-training. Our code and pre-trained models will be released to
facilitate the research community.
Related papers
- Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies.
Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits.
We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Efficient Multimodal Learning from Data-centric Perspective [21.35857180519653]
We introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning.
Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks.
arXiv Detail & Related papers (2024-02-18T10:09:10Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Lessons learned from the NeurIPS 2021 MetaDL challenge: Backbone
fine-tuning without episodic meta-learning dominates for few-shot learning
image classification [40.901760230639496]
We describe the design of the MetaDL competition series, the datasets, the best experimental results, and the top-ranked methods in the NeurIPS 2021 challenge.
The solutions of the top participants have been open-sourced.
arXiv Detail & Related papers (2022-06-15T10:27:23Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z) - Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.