MuMUR : Multilingual Multimodal Universal Retrieval
- URL: http://arxiv.org/abs/2208.11553v6
- Date: Mon, 18 Sep 2023 15:33:41 GMT
- Title: MuMUR : Multilingual Multimodal Universal Retrieval
- Authors: Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar
Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
- Abstract summary: We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
- Score: 19.242056928318913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal retrieval has seen tremendous progress with the development of
vision-language models. However, further improving these models require
additional labelled data which is a huge manual effort. In this paper, we
propose a framework MuMUR, that utilizes knowledge transfer from a multilingual
model to boost the performance of multi-modal (image and video) retrieval. We
first use state-of-the-art machine translation models to construct pseudo
ground-truth multilingual visual-text pairs. We then use this data to learn a
joint vision-text representation where English and non-English text queries are
represented in a common embedding space based on pretrained multilingual
models. We evaluate our proposed approach on a diverse set of retrieval
datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades
and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and
Multi30k . Experimental results demonstrate that our approach achieves
state-of-the-art results on all video retrieval datasets outperforming previous
models. Additionally, our framework MuMUR significantly beats other
multilingual video retrieval dataset. We also observe that MuMUR exhibits
strong performance on image retrieval. This demonstrates the universal ability
of MuMUR to perform retrieval across all visual inputs (image and video) and
text inputs (monolingual and multilingual).
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Exploring Vision Language Models for Multimodal and Multilingual Stance Detection [9.079302402271491]
Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks.
Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios relatively underexplored.
This paper evaluates state-of-the-art Vision-Language Models (VLMs) on multimodal and multilingual stance detection tasks.
arXiv Detail & Related papers (2025-01-29T13:39:53Z) - jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images [5.587329786636647]
Contrastive Language-Image Pretraining (CLIP) is a highly effective method for aligning images and texts in a shared embedding space.
CLIP models often struggle with text-only tasks, underperforming compared to specialized text models.
In this work, we build upon our previous model, jina-clip-v1, by introducing a refined framework that utilizes multi-task, multi-stage contrastive learning across multiple languages.
The resulting model, jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks, while adding multilingual support, better understanding of complex visual documents and efficiency gains.
arXiv Detail & Related papers (2024-12-11T22:28:12Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval [39.41224716332499]
We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval.
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages.
We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
arXiv Detail & Related papers (2022-10-07T15:30:24Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models [144.85290716246533]
We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
arXiv Detail & Related papers (2021-03-16T04:37:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.