Related papers: UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

URL: http://arxiv.org/abs/2405.10311v2
Date: Sun, 20 Oct 2024 05:49:18 GMT
Title: UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
Authors: Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, Jimmy Lin,
Abstract summary: We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.
Score: 76.30799731147589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multi-Modal (MM) Large Language Models (LLMs) have unlocked many complex use-cases that require MM understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of MM-LLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.

Related papers

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation [23.118080583803266]
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation.<n>Our key innovation is a strategy called recaptioning, focusing on the pre-detection stage.<n>For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality.
arXiv Detail & Related papers (2025-08-01T18:19:51Z)
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models [63.27511432647797]
Vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.<n>Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.
arXiv Detail & Related papers (2025-06-18T17:59:49Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Synthetic Multimodal Question Generation [60.33494376081317]
Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. We propose SMMQG, a synthetic data generation framework that generates question and answer pairs directly from multimodal documents. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it.
arXiv Detail & Related papers (2024-07-02T12:57:42Z)
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning [44.497776004372724]
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. We present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors.
arXiv Detail & Related papers (2024-06-25T17:55:11Z)
A Review of Multi-Modal Large Language and Vision Models [1.9685736810241874]
Large Language Models (LLMs) have emerged as a focal point of research and application. Recently, LLMs have been extended into multi-modal large language models (MM-LLMs) This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs.
arXiv Detail & Related papers (2024-03-28T15:53:45Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
MoAI: Mixture of All Intelligence for Large Language and Vision Models [42.182009352159]
Mixture of All Intelligence (MoAI) is an instruction-tuned large language and vision model (LLVM) MoAI uses auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot vision language (VL) tasks.
arXiv Detail & Related papers (2024-03-12T10:44:13Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs [48.269363759989915]
The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching. We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
arXiv Detail & Related papers (2024-01-05T00:26:07Z)
VIGC: Visual Instruction Generation and Correction [47.477290387002284]
The scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data. This paper proposes the Visual Instruction Generation and Correction framework that enables multimodal large language models to generate instruction-tuning data.
arXiv Detail & Related papers (2023-08-24T11:21:05Z)
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG) MuRAG accesses an external non-parametric multimodal memory to augment language generation. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
Augmenting Interpretable Models with LLMs during Training [73.40079895413861]
We propose Augmented Interpretable Models (Aug-imodels) to build efficient and interpretable models. Aug-imodels use LLMs during fitting but not during inference, allowing complete transparency. We explore two instantiations of Aug-imodels in natural-language processing: (i) Aug-GAM, which augments a generalized additive model with decoupled embeddings from an LLM and (ii) Aug-Tree, which augments a decision tree with LLM feature expansions.
arXiv Detail & Related papers (2022-09-23T18:36:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.