Related papers: ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

URL: http://arxiv.org/abs/2112.09331v1
Date: Fri, 17 Dec 2021 05:40:28 GMT
Title: ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources
Authors: Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie
Abstract summary: We provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. We collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods. Our code and pre-trained models will be released to facilitate the research community.
Score: 13.30815073857842
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we explore a stack of simple but effective heuristics, and provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. We provide a reproducible strong baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our method on large-scale data. We hope that this work will provide useful data points and experience for future research in multi-modal pre-training. Our code and pre-trained models will be released to facilitate the research community.

Related papers

A Theoretical Framework for Data Efficient Multi-Source Transfer Learning Based on Cramér-Rao Bound [16.49737340580437]
We propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure that aligns with cross-entropy loss, and minimize it based on the Cram'er-Rao Bound to determine the optimal transfer quantity for each source task. We develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for training deep multi-source transfer learning models.
arXiv Detail & Related papers (2025-02-06T17:32:49Z)
PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision [7.896850422430362]
Inertial Measurement Units (IMUs) embedded in personal devices have enabled significant applications in health and wellness. While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and
arXiv Detail & Related papers (2024-11-22T18:46:30Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Efficient Multimodal Learning from Data-centric Perspective [21.35857180519653]
We introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks.
arXiv Detail & Related papers (2024-02-18T10:09:10Z)
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
Lessons learned from the NeurIPS 2021 MetaDL challenge: Backbone fine-tuning without episodic meta-learning dominates for few-shot learning image classification [40.901760230639496]
We describe the design of the MetaDL competition series, the datasets, the best experimental results, and the top-ranked methods in the NeurIPS 2021 challenge. The solutions of the top participants have been open-sourced.
arXiv Detail & Related papers (2022-06-15T10:27:23Z)
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z)
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods. We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.