What If We Recaption Billions of Web Images with LLaMA-3?
- URL: http://arxiv.org/abs/2406.08478v2
- Date: Tue, 18 Jun 2024 11:47:26 GMT
- Title: What If We Recaption Billions of Web Images with LLaMA-3?
- Authors: Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie,
- Abstract summary: We fine-tune a LLaMA-3 powered LLaVA-1.5 and employ it to recaption 1.3 billion images from the DataComp-1B dataset.
Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models.
- Score: 46.20091244944309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval.
CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z) - Compress & Align: Curating Image-Text Data with Human Knowledge [36.34714164235438]
This paper introduces a novel algorithm, rooted in human knowledge, to compress web-crawled image-text datasets to a compact and high-quality form.
A reward model on the annotated dataset internalizes the nuanced human understanding of image-text alignment.
Experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to 90%.
arXiv Detail & Related papers (2023-12-11T05:57:09Z) - Large Language Models can Share Images, Too! [5.505013339790826]
This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting.
We introduce the PhotoChat++ dataset, which includes enriched intent, triggering sentence, image description, and salient information.
With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting.
arXiv Detail & Related papers (2023-10-23T10:59:21Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.