jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
- URL: http://arxiv.org/abs/2412.08802v1
- Date: Wed, 11 Dec 2024 22:28:12 GMT
- Title: jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
- Authors: Andreas Koukounas, Georgios Mastrapas, Bo Wang, Mohammad Kalim Akram, Sedigheh Eslami, Michael Günther, Isabelle Mohr, Saba Sturua, Scott Martens, Nan Wang, Han Xiao,
- Abstract summary: Contrastive Language-Image Pretraining (CLIP) is a highly effective method for aligning images and texts in a shared embedding space.
CLIP models often struggle with text-only tasks, underperforming compared to specialized text models.
In this work, we build upon our previous model, jina-clip-v1, by introducing a refined framework that utilizes multi-task, multi-stage contrastive learning across multiple languages.
The resulting model, jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks, while adding multilingual support, better understanding of complex visual documents and efficiency gains.
- Score: 5.587329786636647
- License:
- Abstract: Contrastive Language-Image Pretraining (CLIP) is a highly effective method for aligning images and texts in a shared embedding space. These models are widely used for tasks such as cross-modal information retrieval and multi-modal understanding. However, CLIP models often struggle with text-only tasks, underperforming compared to specialized text models. This performance disparity forces retrieval systems to rely on separate models for text-only and multi-modal tasks. In this work, we build upon our previous model, jina-clip-v1, by introducing a refined framework that utilizes multi-task, multi-stage contrastive learning across multiple languages, coupled with an improved training recipe to enhance text-only retrieval. The resulting model, jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks, while adding multilingual support, better understanding of complex visual documents and efficiency gains thanks to Matryoshka Representation Learning and vector truncation. The model performs comparably to the state-of-the-art in both multilingual-multimodal and multilingual text retrieval benchmarks, addressing the challenge of unifying text-only and multi-modal retrieval systems.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models [43.16111789538798]
We build parallel multilingual prompts aimed at harnessing the multilingual capabilities of large multimodal models (LMMs)
Experiments on two LMMs across 3 benchmarks show that our method, PMT2I achieves, superior performance in general, compositional, and fine-grained assessments.
arXiv Detail & Related papers (2025-01-13T06:41:23Z) - Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval [10.603148564713518]
We present a new embedding model VISTA for universal multi-modal retrieval.
We introduce a flexible architecture which extends a powerful text encoder with the image understanding capability.
Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model.
arXiv Detail & Related papers (2024-06-06T17:37:47Z) - Jina CLIP: Your CLIP Model Is Also Your Text Retriever [5.110454439882224]
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors.
We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
arXiv Detail & Related papers (2024-05-30T16:07:54Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Visual Grounding Strategies for Text-Only Natural Language Processing [1.2183405753834562]
multimodal extensions of BERT allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering.
Here, we leverage multimodal modeling for purely textual tasks with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy.
A first type of strategy, referred to as it transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input.
The second one, which we call it associative grounding, harnesses image retrieval to match texts with related images during both
arXiv Detail & Related papers (2021-03-25T16:03:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.