MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
- URL: http://arxiv.org/abs/2503.20698v3
- Date: Wed, 23 Apr 2025 14:10:39 GMT
- Title: MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion
- Authors: Saron Samuel, Dan DeGenaro, Jimena Guallar-Blasco, Kate Sanders, Oluwaseun Eisape, Arun Reddy, Alexander Martin, Andrew Yates, Eugene Yang, Cameron Carpenter, David Etter, Efsun Kayi, Matthew Wiesner, Kenton Murray, Reno Kriz,
- Abstract summary: We create a search system that extracts text and features from both visual and audio modalities.<n> MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
- Score: 44.45109614673675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.
Related papers
- Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.30364248231053]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG)<n>M2RAG is a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs)<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection [1.5236380958983644]
Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores.<n>Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth map.<n>We propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues.
arXiv Detail & Related papers (2025-01-18T08:09:44Z) - Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation [97.82707398481273]
We develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF)<n>Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner.<n>We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models.
arXiv Detail & Related papers (2025-01-13T07:51:43Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques.
We show that speech-based multi-modal retrieval outperforms text based retrieval.
We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.