RzenEmbed: Towards Comprehensive Multimodal Retrieval
- URL: http://arxiv.org/abs/2510.27350v1
- Date: Fri, 31 Oct 2025 10:34:51 GMT
- Title: RzenEmbed: Towards Comprehensive Multimodal Retrieval
- Authors: Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, Yuhui Yin,
- Abstract summary: RzenEmbed is a unified framework to learn embeddings across a diverse set of modalities.<n>We employ a novel two-stage training strategy to learn discriminative representations.<n>RzenEmbed sets a new state-of-the-art on the MMEB benchmark.
- Score: 11.540508319550087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.
Related papers
- Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z) - Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model [28.559525134847828]
We present UniPic2-SD3.5M-Kontext, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework.<n>Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data.<n>UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters.
arXiv Detail & Related papers (2025-09-04T17:00:17Z) - QZhou-Embedding Technical Report [16.213081669689185]
Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies.<n>Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance.
arXiv Detail & Related papers (2025-08-29T13:47:22Z) - VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.43882565434444]
We propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms.<n>First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types.<n>Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs.
arXiv Detail & Related papers (2025-07-07T00:51:57Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.