Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer
- URL: http://arxiv.org/abs/2105.05975v1
- Date: Wed, 12 May 2021 21:22:58 GMT
- Title: Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer
- Authors: B{\l}a\.zej Dolicki and Gerasimos Spanakis
- Abstract summary: We analyze a state-of-the-art multilingual model and try to determine what impacts good transfer between languages.
We show that looking at particular syntactic features is 2-4 times more helpful in predicting the performance than an aggregated syntactic similarity.
- Score: 3.299672391663527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is an increasing amount of evidence that in cases with little or no
data in a target language, training on a different language can yield
surprisingly good results. However, currently there are no established
guidelines for choosing the training (source) language. In attempt to solve
this issue we thoroughly analyze a state-of-the-art multilingual model and try
to determine what impacts good transfer between languages. As opposed to the
majority of multilingual NLP literature, we don't only train on English, but on
a group of almost 30 languages. We show that looking at particular syntactic
features is 2-4 times more helpful in predicting the performance than an
aggregated syntactic similarity. We find out that the importance of syntactic
features strongly differs depending on the downstream task - no single feature
is a good performance predictor for all NLP tasks. As a result, one should not
expect that for a target language $L_1$ there is a single language $L_2$ that
is the best choice for any NLP task (for instance, for Bulgarian, the best
source language is French on POS tagging, Russian on NER and Thai on NLI). We
discuss the most important linguistic features affecting the transfer quality
using statistical and machine learning methods.
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing
Prediction of Political Polarity in Multilingual News Headlines [0.0]
We use the method of translation and retrieval to acquire the inferential knowledge in the target language.
We then employ an attention mechanism to emphasise important inferences.
We present a dataset of over 62.6K multilingual news headlines in five European languages annotated with their respective political polarities.
arXiv Detail & Related papers (2022-12-01T06:07:01Z) - Language Chameleon: Transformation analysis between languages using
Cross-lingual Post-training based on Pre-trained language models [4.731313022026271]
In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT)
Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
arXiv Detail & Related papers (2022-09-14T05:20:52Z) - Unravelling Interlanguage Facts via Explainable Machine Learning [10.71581852108984]
We focus on the internals of an NLI classifier trained by an emphexplainable machine learning algorithm.
We use this perspective in order to tackle both NLI and a companion task, guessing whether a text has been written by a native or a non-native speaker.
We investigate which kind of linguistic traits are most effective for solving our two tasks, namely, are most indicative of a speaker's L1.
arXiv Detail & Related papers (2022-08-02T14:05:15Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Zero-Shot Cross-Lingual Transfer with Meta Learning [45.29398184889296]
We consider the setting of training models on multiple languages at the same time, when little or no data is available for languages other than English.
We show that this challenging setup can be approached using meta-learning.
We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks.
arXiv Detail & Related papers (2020-03-05T16:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.