Related papers: What If We Recaption Billions of Web Images with LLaMA-3?

What If We Recaption Billions of Web Images with LLaMA-3?

URL: http://arxiv.org/abs/2406.08478v2
Date: Tue, 18 Jun 2024 11:47:26 GMT
Title: What If We Recaption Billions of Web Images with LLaMA-3?
Authors: Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie,
Abstract summary: We fine-tune a LLaMA-3 powered LLaVA-1.5 and employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models.
Score: 46.20091244944309
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

Related papers

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z)
FG-CLIP: Fine-Grained Visual and Textual Alignment [3.830067625507938]
We propose Fine-Grained CLIP, which enhances fine-grained understanding through three key innovations.<n>We leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details.<n>We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with hard fine-grained negative samples.
arXiv Detail & Related papers (2025-05-08T09:06:53Z)
TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z)
IDEA: Image Description Enhanced CLIP-Adapter [23.446016867479138]
We propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. IDEA captures fine-grained features by leveraging both visual features and textual descriptions of images. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets.
arXiv Detail & Related papers (2025-01-15T14:12:59Z)
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations. We re-annotated the COCO and Visual Genome datasets using our FullAnno system. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Our dataset has 15 times larger scales while maintaining good data quality. We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z)
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval. CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z)
Large Language Models can Share Images, Too! [5.505013339790826]
This paper explores the image-sharing capability of Large Language Models (LLMs), such as GPT-4 and LLaMA 2, in a zero-shot setting. We introduce the PhotoChat++ dataset, which includes enriched intent, triggering sentence, image description, and salient information. With extensive experiments, we unlock the image-sharing capability of DribeR equipped with LLMs in zero-shot prompting.
arXiv Detail & Related papers (2023-10-23T10:59:21Z)
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images. We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z)
MoMo: A shared encoder Model for text, image and multi-Modal representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks. We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z)
Generative Negative Text Replay for Continual Vision-Language Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently. Massive data are usually collected in a streaming fashion. We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z)
LAION-5B: An open large-scale dataset for training next generation image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset. We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.