Joint Adaptive Representations for Image-Language Learning
- URL: http://arxiv.org/abs/2305.19924v2
- Date: Thu, 1 Jun 2023 12:41:06 GMT
- Title: Joint Adaptive Representations for Image-Language Learning
- Authors: AJ Piergiovanni and Anelia Angelova
- Abstract summary: We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
- Score: 59.40890927221377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-language learning has made unprecedented progress in visual
understanding. These developments have come at high costs, as contemporary
vision-language models require large model scales and amounts of data. We here
propose a much easier recipe for image-language learning, which produces
effective models, outperforming bigger and more expensive ones, often trained
on orders of magnitude larger datasets. Our key finding is the joint learning
of a compact vision and language representation, which adaptively and
iteratively fuses the multi-modal features. This results in a more effective
image-language learning, greatly lowering the FLOPs by combining and reducing
the number of tokens for both text and images, e.g. a 33\% reduction in FLOPs
is achieved, compared to baseline fusion techniques used by popular
image-language models, while improving performance. This also allows the model
to scale without a large increase in FLOPs or memory. In addition, we propose
adaptive pre-training data sampling which improves the data efficiency. The
proposed approach achieves competitive performance compared to much larger
models, and does so with significantly less data and FLOPs. With only 40M
training examples and with 39 GFLOPs our lightweight model outperforms many
times larger state-of-the-art models of 2-20x more FLOPs and using bigger
datasets some of which with close to 1B training examples.
Related papers
- CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning [4.004641316826348]
We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT)
Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets.
The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
arXiv Detail & Related papers (2024-07-30T17:57:32Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning [6.648544684097181]
multimodal models integrate vision and language into visual-language models (VLMs)
This paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters to train VLMs.
Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.
arXiv Detail & Related papers (2024-04-12T00:36:43Z) - On the Scalability of Diffusion-based Text-to-Image Generation [97.64837704129005]
We study scaling properties of diffusion based text-to-image (T2I) models.
For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs.
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size.
arXiv Detail & Related papers (2024-04-03T17:34:28Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - LM-CPPF: Paraphrasing-Guided Data Augmentation for Contrastive
Prompt-Based Few-Shot Fine-Tuning [7.543506531838883]
This paper proposes LM- CPPF, Contrastive Paraphrasing-guided Prompt-based Fine-tuning of Language Models.
Our experiments on multiple text classification benchmarks show that this augmentation method outperforms other methods.
arXiv Detail & Related papers (2023-05-29T15:59:51Z) - Multimodal Data Augmentation for Image Captioning using Diffusion Models [12.221685807426264]
We propose a data augmentation method, leveraging a text-to-image model called Stable Diffusion, to expand the training set.
Experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods.
Further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data.
arXiv Detail & Related papers (2023-05-03T01:57:33Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.