BRIT: Bidirectional Retrieval over Unified Image-Text Graph
- URL: http://arxiv.org/abs/2505.18450v1
- Date: Sat, 24 May 2025 01:20:51 GMT
- Title: BRIT: Bidirectional Retrieval over Unified Image-Text Graph
- Authors: Ainulla Khan, Yamada Moyuru, Srinidhi Akella,
- Abstract summary: Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models.<n>This paper proposes BRIT, a novel multi-modal RAG framework that unifies various text-image connections in the document into a multi-modal graph.<n>By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieves not only directly query-relevant images and texts but also further relevant contents.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.
Related papers
- Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation [12.631059980161435]
We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components.<n>Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever.<n>Experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality.
arXiv Detail & Related papers (2025-05-28T04:09:49Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing [3.271701183630488]
We present a self-supervised cross-modal text-image time series retrieval (text-ITSR) method in remote sensing (RS)<n>We focus our attention on text-ITSR in pairs of images (i.e., bitemporal images)<n>The proposed text-ITSR method consists of two key components: 1) modality-specific encoders to model the semantic content of bitemporal images and text sentences with discriminative features; and 2) modality-specific projection heads to align textual and image representations in a shared embedding space.
arXiv Detail & Related papers (2025-01-31T11:14:38Z) - MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval [26.585985828583304]
We introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment.<n>Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations.<n>Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both.
Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach.
We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z) - TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models [96.72318842152148]
We propose a unified framework for text-to-image generation and retrieval with one single Large Multimodal Model (LMM)<n> Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner.<n>We then propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER [4.510210055307459]
multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets.
We introduce a method of text-image relation propagation into the multimodal BERT model.
We propose a multitask algorithm to train on the MNER datasets.
arXiv Detail & Related papers (2021-02-05T02:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.