FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
- URL: http://arxiv.org/abs/2411.11927v2
- Date: Fri, 22 Nov 2024 03:30:29 GMT
- Title: FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
- Authors: Anjia Cao, Xing Wei, Zhiheng Ma,
- Abstract summary: Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders.
We propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders.
FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy.
- Score: 21.372374962328948
- License:
- Abstract: Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9\% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% in average image-to-text recall@1 across 36 languages, and by 34.6\% in text-to-image recall@1 for long-context retrieval on Urban-1k. Code is available at \url{https://github.com/MIV-XJTU/FLAME}.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts.
Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation.
We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders.
We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts.
We introduce a novel pre-training framework, that learns importance-aware lexicon representations.
Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z) - Data Efficient Language-supervised Zero-shot Recognition with Optimal
Transport Distillation [43.03533959429743]
We propose OTTER, which uses online optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
arXiv Detail & Related papers (2021-12-17T11:27:26Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Data-Efficient Language-Supervised Zero-Shot Learning with
Self-Distillation [23.631184498984933]
Natural language has been shown to be a broader and richer source of supervision than supervised "gold" labels.
We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
arXiv Detail & Related papers (2021-04-18T19:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.