Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video
- URL: http://arxiv.org/abs/2510.03458v1
- Date: Fri, 03 Oct 2025 19:29:50 GMT
- Title: Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video
- Authors: Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer,
- Abstract summary: We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs.<n>We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
- Score: 5.732421858297378
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
Related papers
- V-Agent: An Interactive Video Search System Using Vision-Language Models [5.245473886566199]
V-Agent is a novel multi-agent platform designed for advanced video search and interactive user-system conversations.<n>Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark.
arXiv Detail & Related papers (2025-11-04T07:24:45Z) - MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed [55.526939500742]
We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
arXiv Detail & Related papers (2025-06-11T05:40:26Z) - Large Language Models Implicitly Learn to See and Hear Just By Reading [61.3564313676731]
By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio.<n>We show this working for image classification on CIFAR-10 and Fashion-MNIST, as well as on image patches.
arXiv Detail & Related papers (2025-05-20T22:20:16Z) - Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z) - Movie2Story: A framework for understanding videos and telling stories in the form of novel text [0.0]
We propose a novel benchmark to evaluate text generation capabilities in scenarios enriched with auxiliary information.<n>Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information.<n>Our experiments reveal that current Multi-modal Large Language Models (MLLMs) perform suboptimally under the proposed evaluation metrics.
arXiv Detail & Related papers (2024-12-19T15:44:04Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context.
Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks.
Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z) - VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
Dataset [17.927825332032477]
We explore an automatically generated large-scale omni-modality video caption dataset called VAST-27M.
We first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions.
We employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions.
arXiv Detail & Related papers (2023-05-29T14:34:50Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.