MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed
- URL: http://arxiv.org/abs/2506.09409v1
- Date: Wed, 11 Jun 2025 05:40:26 GMT
- Title: MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed
- Authors: Jiaqi Samantha Zhan, Crystina Zhang, Shengyao Zhuang, Xueguang Ma, Jimmy Lin,
- Abstract summary: We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
- Score: 55.526939500742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective video retrieval remains challenging due to the complexity of integrating visual, auditory, and textual modalities. In this paper, we explore unified retrieval methods using OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, in the context of the MAGMaR shared task. Evaluated on the comprehensive MultiVENT 2.0 dataset, OmniEmbed generates unified embeddings for text, images, audio, and video, enabling robust multimodal retrieval. By finetuning OmniEmbed with the combined multimodal data--visual frames, audio tracks, and textual descriptions provided in MultiVENT 2.0, we achieve substantial improvements in complex, multilingual video retrieval tasks. Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025, highlighting the practical effectiveness of our unified multimodal retrieval approach. Model checkpoint in this work is opensourced.
Related papers
- CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval [70.9990850395981]
We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata.<n>CLaMR is trained to enhance dynamic modality selection via two key innovations.
arXiv Detail & Related papers (2025-06-06T15:02:30Z) - OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts [46.77966058862399]
We introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts.<n>We propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
arXiv Detail & Related papers (2025-03-29T02:46:58Z) - MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [43.725594356981254]
We create a search system that extracts text and features from both visual and audio modalities.<n> MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z) - DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering [13.466266412068475]
We introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours.<n>The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities.<n>Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration.
arXiv Detail & Related papers (2025-03-20T06:21:25Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.<n>It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.<n>Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.