jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
- URL: http://arxiv.org/abs/2506.18902v3
- Date: Mon, 07 Jul 2025 17:41:02 GMT
- Title: jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
- Authors: Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao,
- Abstract summary: We introduce jina-embeddings-v4, a multimodal embedding model that unifies text and image representations.<n>The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios.<n>To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
- Score: 5.587329786636647
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-document retrieval, semantic text similarity, and code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single-modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
Related papers
- VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.43882565434444]
We propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms.<n>First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types.<n>Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs.
arXiv Detail & Related papers (2025-07-07T00:51:57Z) - MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling [58.251621637466904]
Muti-query Scene Text retrieval with Attention Recycling (MSTAR) is a box-free approach for scene text retrieval.<n>It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts.<n>Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset.
arXiv Detail & Related papers (2025-06-12T11:54:13Z) - A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [44.008094698200026]
Cross-modal retrieval is gaining increasing efficacy and interest from the research community.<n>In this paper, we design an approach that allows for multimodal queries composed of both an image and a text.<n>Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones.
arXiv Detail & Related papers (2025-03-03T19:01:17Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - jina-embeddings-v3: Multilingual Embeddings With Task LoRA [6.926642162309072]
jina-embeddings-v3 is a novel text embedding model with 570 million parameters.
It achieves state-of-the-art performance on multilingual data and long-context retrieval tasks.
It supports context lengths of up to 8192 tokens.
arXiv Detail & Related papers (2024-09-16T11:10:29Z) - Localizing Events in Videos with Multimodal Queries [61.20556229245365]
Localizing events in videos based on semantic queries is a pivotal task in video understanding.
We introduce ICQ, a new benchmark designed for localizing events in videos with multimodal queries.
We propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy.
arXiv Detail & Related papers (2024-06-14T14:35:58Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.