An Empirical Study of Factors Affecting Language-Independent Models
- URL: http://arxiv.org/abs/1912.13106v1
- Date: Mon, 30 Dec 2019 22:41:57 GMT
- Title: An Empirical Study of Factors Affecting Language-Independent Models
- Authors: Xiaotong Liu, Yingbei Tong, Anbang Xu, Rama Akkiraju
- Abstract summary: We show that language-independent models can be comparable to or even outperforms the models trained using monolingual data.
We experiment language-independent models with many different languages and show that they are more suitable for typologically similar languages.
- Score: 11.976665726887733
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling existing applications and solutions to multiple human languages has
traditionally proven to be difficult, mainly due to the language-dependent
nature of preprocessing and feature engineering techniques employed in
traditional approaches. In this work, we empirically investigate the factors
affecting language-independent models built with multilingual representations,
including task type, language set and data resource. On two most representative
NLP tasks -- sentence classification and sequence labeling, we show that
language-independent models can be comparable to or even outperforms the models
trained using monolingual data, and they are generally more effective on
sentence classification. We experiment language-independent models with many
different languages and show that they are more suitable for typologically
similar languages. We also explore the effects of different data sizes when
training and testing language-independent models, and demonstrate that they are
not only suitable for high-resource languages, but also very effective in
low-resource languages.
Related papers
- From Isolates to Families: Using Neural Networks for Automated Language Affiliation [9.182884165239996]
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow.
Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation.
We present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families.
arXiv Detail & Related papers (2025-02-17T11:25:32Z) - Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models [1.5703073293718952]
Token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance.
These insights offer valuable guidance for developing more equitable and effective multilingual language models.
arXiv Detail & Related papers (2024-12-17T03:05:26Z) - Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques.
Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.