Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
- URL: http://arxiv.org/abs/2601.04720v1
- Date: Thu, 08 Jan 2026 08:36:06 GMT
- Title: Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
- Authors: Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin,
- Abstract summary: We introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series.<n>The Qwen3-VL-Embedding model employs a multi-stage training paradigm to generate semantically rich high-dimensional vectors.<n>Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs.
- Score: 80.53668824533493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
Related papers
- Qwen3-VL Technical Report [153.3964813640593]
Qwen3-VL is the most capable vision-language model to date, achieving superior performance across a broad range of multimodal benchmarks.<n>It supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video.<n>Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-token comprehension with a native 256K-token window for both text and interleaved multimodal inputs; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks
arXiv Detail & Related papers (2025-11-26T17:59:08Z) - Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models [90.54780244175511]
We introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series.<n>The Qwen3 Embedding series offers a spectrum of model sizes for both embedding and reranking tasks.<n>The Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks.
arXiv Detail & Related papers (2025-06-05T15:49:48Z) - Qwen3 Technical Report [137.96804244102205]
We present Qwen3, the latest version of the Qwen model family.<n>Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities.
arXiv Detail & Related papers (2025-05-14T13:41:34Z) - VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering [8.21219588747224]
This paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a vision encoder with a sequence-to-sequence language model.<n>VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space.<n> Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-04-11T05:51:44Z) - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond [72.41822115096741]
We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs)
We endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.
The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales.
arXiv Detail & Related papers (2023-08-24T17:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.