Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text
- URL: http://arxiv.org/abs/2601.10096v2
- Date: Tue, 20 Jan 2026 20:40:18 GMT
- Title: Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text
- Authors: Piyush Singh Pasi,
- Abstract summary: Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages.<n>Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized.<n>We introduce M2M, a lightweight alignment method that learns only a few linear layers--using English text alone--to map multilingual text embeddings into multimodal space.
- Score: 1.3343730111342615
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers--using English text alone--to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, M2M demonstrates robustness across datasets and tasks, extending to Audio-Text retrieval and Text-to-Image generation. We release code and checkpoints (https://github.com/piyushsinghpasi/M2M) along with multilingual evaluation datasets: MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual).
Related papers
- uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data [3.364569898365253]
We propose a lightweight and data-efficient framework for multilingual vision-language alignment.<n>Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training.<n>This minimal training setup enables robust multilingual alignment even for languages with limited supervision.
arXiv Detail & Related papers (2025-11-17T06:34:49Z) - mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.83121058429025]
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data.<n>mOSCAR is the first large-scale multilingual and multimodal document corpus crawled from the web.<n>It covers 163 languages, 303M documents, 200B tokens and 1.15B images.
arXiv Detail & Related papers (2024-06-13T00:13:32Z) - Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - Towards Zero-shot Cross-lingual Image Retrieval and Tagging [1.4425878137951236]
We present a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side.
We introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform.
We also demonstrate how a cross-lingual model can be used for downstream tasks like multi-lingual image tagging in a zero shot manner.
arXiv Detail & Related papers (2021-09-15T23:39:15Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models [144.85290716246533]
We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
arXiv Detail & Related papers (2021-03-16T04:37:40Z) - Towards Zero-shot Cross-lingual Image Retrieval [2.5110144299197716]
We present a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side.
We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts from each other.
We use this as the test set for evaluating zero-shot model performance across languages.
arXiv Detail & Related papers (2020-11-24T22:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.