Related papers: MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

URL: http://arxiv.org/abs/2403.10691v1
Date: Fri, 15 Mar 2024 21:21:11 GMT
Title: MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Authors: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer,
Abstract summary: We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. MYTE produces shorter encodings for all 99 analyzed languages. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
Score: 70.34758460372629
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

Related papers

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages [15.203789021094982]
In large language models (LLMs), how are multiple languages learned and encoded? We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages.
arXiv Detail & Related papers (2025-01-10T21:18:21Z)
LangSAMP: Language-Script Aware Multilingual Pretraining [48.16511046793275]
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings. LangSAMP incorporates both language and script embeddings to enhance representation learning. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages.
arXiv Detail & Related papers (2024-09-26T18:29:10Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models. We focus on the Indic languages, which have the highest script diversity in the world. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding. XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model. Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge. However, studies on LMs' factual representation ability have almost invariably been performed on English. We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties [17.404220737977738]
We propose methods for probing sentence representations from state-of-the-art multilingual encoders. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies.
arXiv Detail & Related papers (2020-09-27T15:00:52Z)
Language-agnostic Multilingual Modeling [23.06484126933893]
We build a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.
arXiv Detail & Related papers (2020-04-20T18:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.