Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
- URL: http://arxiv.org/abs/2407.14878v1
- Date: Sat, 20 Jul 2024 13:56:39 GMT
- Title: Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
- Authors: Yongxin Huang, Kexin Wang, Goran Glavaš, Iryna Gurevych,
- Abstract summary: Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages.
We train language-specific sentence encoders to avoid negative interference between languages.
We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each.
- Score: 50.80949663719335
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multilingual sentence encoders are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space. As such, they are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing. Another limitation of multilingual sentence encoders is the trade-off between monolingual and cross-lingual performance. Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages, harming the utility of sentence embeddings in monolingual tasks. In this work, we address both issues by modular training of sentence encoders, i.e., by separating monolingual specialization from cross-lingual alignment. We first efficiently train language-specific sentence encoders to avoid negative interference between languages (i.e., the curse). We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each, preventing interference with monolingual specialization from the first step. In both steps, we resort to contrastive learning on machine-translated paraphrase data. Monolingual and cross-lingual evaluations on semantic text similarity/relatedness and multiple-choice QA render our modular solution more effective than multilingual sentence encoders, especially benefiting low-resource languages.
Related papers
- Code-mixed Sentiment and Hate-speech Prediction [2.9140539998069803]
Large language models have dominated most natural language processing tasks.
We created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages.
We performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages.
arXiv Detail & Related papers (2024-05-21T16:56:36Z) - Fine-Tuned Self-Supervised Speech Representations for Language
Diarization in Multilingual Code-Switched Speech [4.39549503760707]
We develop a continuous multilingual language diarizer using fine-tuned speech representations extracted from a large self-supervised architecture (WavLM)
We experiment with a code-switched corpus consisting of five South African languages (isiZulu, isiXa, Setswana, Sesotho and English)
arXiv Detail & Related papers (2023-12-15T09:40:41Z) - Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment [42.624862172666624]
We propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences.
It aligns the internal sentence representations across different languages via multilingual contrastive learning.
Experimental results show that even with less than 0.1 textperthousand of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models.
arXiv Detail & Related papers (2023-11-14T11:24:08Z) - Bitext Mining Using Distilled Sentence Representations for Low-Resource
Languages [12.00637655338665]
We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model.
We train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
arXiv Detail & Related papers (2022-05-25T10:53:24Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.