What If We Recaption Billions of Web Images with LLaMA-3?
- URL: http://arxiv.org/abs/2406.08478v2
- Date: Tue, 18 Jun 2024 11:47:26 GMT
- Title: What If We Recaption Billions of Web Images with LLaMA-3?
- Authors: Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie,
- Abstract summary: We fine-tune a LLaMA-3 powered LLaVA-1.5 and employ it to recaption 1.3 billion images from the DataComp-1B dataset.
Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models.
- Score: 46.20091244944309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/
Related papers
- HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z) - FG-CLIP: Fine-Grained Visual and Textual Alignment [3.830067625507938]
We propose Fine-Grained CLIP, which enhances fine-grained understanding through three key innovations.<n>We leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details.<n>We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples.
arXiv Detail & Related papers (2025-05-08T09:06:53Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.
Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.
Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - IDEA: Image Description Enhanced CLIP-Adapter [23.446016867479138]
We propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks.
IDEA captures fine-grained features by leveraging both visual features and textual descriptions of images.
As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets.
arXiv Detail & Related papers (2025-01-15T14:12:59Z) - FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations.
We re-annotated the COCO and Visual Genome datasets using our FullAnno system.
Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval.
CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z) - Large Language Models can Share Images, Too! [5.505013339790826]
This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting.
We introduce the PhotoChat++ dataset, which includes enriched intent, triggering sentence, image description, and salient information.
With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting.
arXiv Detail & Related papers (2023-10-23T10:59:21Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.