Related papers: Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

URL: http://arxiv.org/abs/2412.14880v1
Date: Thu, 19 Dec 2024 14:17:09 GMT
Title: Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering
Authors: Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang,
Abstract summary: "Retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage.<n>We propose a novel method to effectively introduce and reference retrieved information into the QA.<n>Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP.
Score: 14.63910474388089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Related papers

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [58.46663983451155]
PixSearch is an end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning.<n>During encoding, PixSearch emits search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries.<n>On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization.
arXiv Detail & Related papers (2026-01-27T00:46:08Z)
Instance-Level Composed Image Retrieval [34.04479584450632]
i-CIR is a new evaluation dataset that focuses on an instance-level class definition.<n>Its design and curation process keep the dataset compact to facilitate future research.<n>We leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC.
arXiv Detail & Related papers (2025-10-29T10:57:59Z)
Generalized Contrastive Learning for Universal Multimodal Retrieval [53.70202081784898]
Cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality.<n>This paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the need for new dataset curation.
arXiv Detail & Related papers (2025-09-30T01:25:04Z)
Chain-of-Thought Re-ranking for Image Retrieval Tasks [16.13448876168839]
We propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address image retrieval.<n>By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making.<n>Our method achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR)
arXiv Detail & Related papers (2025-09-18T08:48:46Z)
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning [62.61187785810336]
ImageScope is a training-free, three-stage framework that unifies language-guided image retrieval tasks. In the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity. In the second and third stages, we reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally.
arXiv Detail & Related papers (2025-03-13T08:43:24Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering. We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z)
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z)
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval. CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z)
Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search [89.1772985740272]
In mixed-initiative conversational search systems, clarifying questions are used to help users who struggle to express their intentions in a single query. We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information. We collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images. Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase.
arXiv Detail & Related papers (2024-02-12T16:04:01Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR) We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z)
Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval. We first examine the related concerns and why the focus is on image-text retrieval tasks. We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z)
Cross-Modal Retrieval Augmentation for Multi-Modal Classification [61.5253261560224]
We explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering. First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement on image-caption retrieval. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines.
arXiv Detail & Related papers (2021-04-16T13:27:45Z)
Using Image Captions and Multitask Learning for Recommending Query Reformulations [11.99358906295761]
We aim to enhance the query recommendation experience for a commercial image search engine. Our proposed methodology incorporates current state-of-the-art practices from relevant literature.
arXiv Detail & Related papers (2020-03-02T08:22:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.