Related papers: BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

URL: http://arxiv.org/abs/2411.07461v1
Date: Tue, 12 Nov 2024 00:52:52 GMT
Title: BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Authors: Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu,
Abstract summary: We introduce BLIP3-KALE, a dataset of 218 million image-text pairs. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
Score: 118.35194230865451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

Related papers

A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models [17.144311122664508]
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior. We propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images.
arXiv Detail & Related papers (2025-02-19T18:35:43Z)
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models [44.578308186225826]
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. We show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance.
arXiv Detail & Related papers (2025-01-31T08:27:31Z)
Visual Lexicon: Rich Image Features in Language Space [99.94214846451347]
ViLex simultaneously captures rich semantic content and fine visual details. ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages.
arXiv Detail & Related papers (2024-12-09T18:57:24Z)
Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP. Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z)
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation [34.45033554641476]
Existing automatic captioning methods for visual content face challenges such as lack of detail, hallucination content, and poor instruction following. We propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions.
arXiv Detail & Related papers (2024-04-30T17:55:27Z)
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z)
Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens [87.52235889917223]
We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
arXiv Detail & Related papers (2023-09-15T16:48:34Z)
Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text. Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z)
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.