Multilingual Pixel Representations for Translation and Effective
Cross-lingual Transfer
- URL: http://arxiv.org/abs/2305.14280v2
- Date: Tue, 24 Oct 2023 13:36:49 GMT
- Title: Multilingual Pixel Representations for Translation and Effective
Cross-lingual Transfer
- Authors: Elizabeth Salesky, Neha Verma, Philipp Koehn, Matt Post
- Abstract summary: We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations.
We explore various properties of pixel representations such as parameter sharing within and across scripts to better understand where they lead to positive transfer.
We observe that these properties not only enable seamless cross-lingual transfer to unseen scripts, but make pixel representations more data-efficient than alternatives such as vocabulary expansion.
- Score: 25.575718310334643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce and demonstrate how to effectively train multilingual machine
translation models with pixel representations. We experiment with two different
data settings with a variety of language and script coverage, demonstrating
improved performance compared to subword embeddings. We explore various
properties of pixel representations such as parameter sharing within and across
scripts to better understand where they lead to positive transfer. We observe
that these properties not only enable seamless cross-lingual transfer to unseen
scripts, but make pixel representations more data-efficient than alternatives
such as vocabulary expansion. We hope this work contributes to more extensible
multilingual models for all languages and scripts.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Exploring Representational Disparities Between Multilingual and Bilingual Translation Models [16.746335565636976]
Some language pairs in multilingual models can see worse performance than in bilingual models, especially in the one-to-many translation setting.
We show that for a given language pair, its multilingual model decoder representations are consistently less isotropic and occupy fewer dimensions than comparable bilingual model decoder representations.
arXiv Detail & Related papers (2023-05-23T16:46:18Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Multilingual Representation Distillation with Contrastive Learning [20.715534360712425]
We integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences.
We validate our approach with multilingual similarity search and corpus filtering tasks.
arXiv Detail & Related papers (2022-10-10T22:27:04Z) - Language Modelling with Pixels [29.976453396194053]
This paper introduces PIXEL, the Pixel-based of Language, which suffers from neither of these issues.
PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages.
We evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts.
arXiv Detail & Related papers (2022-07-14T15:20:36Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - What makes multilingual BERT multilingual? [60.9051207862378]
In this work, we provide an in-depth experimental study to supplement the existing literature of cross-lingual ability.
We compare the cross-lingual ability of non-contextualized and contextualized representation model with the same data.
We found that datasize and context window size are crucial factors to the transferability.
arXiv Detail & Related papers (2020-10-20T05:41:56Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.