Related papers: Enhancing Multi-Image Question Answering via Submodular Subset Selection

Enhancing Multi-Image Question Answering via Submodular Subset Selection

URL: http://arxiv.org/abs/2505.10533v1
Date: Thu, 15 May 2025 17:41:52 GMT
Title: Enhancing Multi-Image Question Answering via Submodular Subset Selection
Authors: Aaryan Sharma, Shivansh Gupta, Samar Agarwal, Vishak Prasad C., Ganesh Ramakrishnan,
Abstract summary: Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but struggle when presented with a collection of multiple images.<n>We propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques.
Score: 16.66633426354087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.

Related papers

Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation [12.631059980161435]
We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components.<n>Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever.<n>Experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality.
arXiv Detail & Related papers (2025-05-28T04:09:49Z)
QuARI: Query Adaptive Retrieval Improvement [10.896025071832055]
We show that a linear transformation of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest.<n>Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings.<n>Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more at query time.
arXiv Detail & Related papers (2025-05-27T18:21:48Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z)
Mixed-Query Transformer: A Unified Image Segmentation Architecture [57.32212654642384]
Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. We introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights.
arXiv Detail & Related papers (2024-04-06T01:54:17Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
LMR: A Large-Scale Multi-Reference Dataset for Reference-based Super-Resolution [86.81241084950524]
It is widely agreed that reference-based super-resolution (RefSR) achieves superior results by referring to similar high quality images, compared to single image super-resolution (SISR) Previous RefSR methods have all focused on single-reference image training, while multiple reference images are often available in testing or practical applications. We construct a large-scale, multi-reference super-resolution dataset, named LMR. It contains 112,142 groups of 300x300 training images, which is 10x of the existing largest RefSR dataset.
arXiv Detail & Related papers (2023-03-09T01:07:06Z)
Self-supervised Multi-view Disentanglement for Expansion of Visual Collections [6.944742823561]
We consider the setting where a query for similar images is derived from a collection of images. For visual search, the similarity measurements may be made along multiple axes, or views, such as style and color. Our objective is to design a retrieval algorithm that effectively combines similarities computed over representations from multiple views.
arXiv Detail & Related papers (2023-02-04T22:09:17Z)
Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. We propose a novel Multi-modal Retrieval based framework (MoRe) MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
Probabilistic Compositional Embeddings for Multimodal Image Retrieval [48.450232527041436]
We investigate a more challenging scenario for composing multiple multimodal queries in image retrieval. Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries. We propose a novel multimodal probabilistic composer (MPC) to learn an informative embedding that can flexibly encode the semantics of various queries.
arXiv Detail & Related papers (2022-04-12T14:45:37Z)
Multi-Image Summarization: Textual Summary from a Set of Cohesive Images [17.688344968462275]
This paper proposes the new task of multi-image summarization. It aims to generate a concise and descriptive textual summary given a coherent set of input images. A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes.
arXiv Detail & Related papers (2020-06-15T18:45:35Z)
Using Image Captions and Multitask Learning for Recommending Query Reformulations [11.99358906295761]
We aim to enhance the query recommendation experience for a commercial image search engine. Our proposed methodology incorporates current state-of-the-art practices from relevant literature.
arXiv Detail & Related papers (2020-03-02T08:22:46Z)
CBIR using features derived by Deep Learning [0.0]
In a Content Based Image Retrieval (CBIR) System, the task is to retrieve similar images from a large database given a query image. We propose to use features derived from pre-trained network models from a deep-learning convolution network trained for a large image classification problem.
arXiv Detail & Related papers (2020-02-13T21:26:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.