MURAL: Multimodal, Multitask Retrieval Across Languages
- URL: http://arxiv.org/abs/2109.05125v1
- Date: Fri, 10 Sep 2021 22:26:05 GMT
- Title: MURAL: Multimodal, Multitask Retrieval Across Languages
- Authors: Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta,
Chao Jia, Yinfei Yang, Jason Baldridge
- Abstract summary: MURAL is a dual encoder that solves two tasks: image-text matching and translation pair matching.
By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs.
It considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages.
- Score: 14.323816604663053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Both image-caption pairs and translation pairs provide the means to learn
deep representations of and connections between languages. We use both types of
pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual
encoder that solves two tasks: 1) image-text matching and 2) translation pair
matching. By incorporating billions of translation pairs, MURAL extends ALIGN
(Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion
noisy image-text pairs. When using the same encoders, MURAL's performance
matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced
languages across several datasets. More importantly, it considerably improves
performance on under-resourced languages, showing that text-text learning can
overcome a paucity of image-caption examples for these languages. On the
Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean
recall by 8.1% on average for eight under-resourced languages and by 6.8% on
average when fine-tuning. We additionally show that MURAL's text
representations cluster not only with respect to genealogical connections but
also based on areal linguistics, such as the Balkan Sprachbund.
Related papers
- UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation [0.09208007322096534]
The aim of SemEval-2024 Task 1 is to develop models for identifying semantic textual relatedness between two sentences.
We develop two STR models, $textitTranSem$ and $textitFineSem$, for the supervised and cross-lingual settings.
arXiv Detail & Related papers (2024-02-20T05:46:29Z) - M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
Efficient Pretraining [26.262677587795242]
We introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs.
To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation.
We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B.
arXiv Detail & Related papers (2024-01-29T05:43:33Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations [53.89380284760555]
We introduce Babel-ImageNet, a massively multilingual benchmark that offers partial translations of ImageNet labels to 100 languages.
We evaluate 11 public multilingual CLIP models on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages.
We show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training.
arXiv Detail & Related papers (2023-06-14T17:53:06Z) - Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Practical Comparable Data Collection for Low-Resource Languages via
Images [126.64069379167975]
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators.
Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all.
arXiv Detail & Related papers (2020-04-24T19:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.