IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
- URL: http://arxiv.org/abs/2410.02611v1
- Date: Thu, 3 Oct 2024 15:50:08 GMT
- Title: IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
- Authors: Akhilesh Aravapalli, Mounika Marreddy, Subba Reddy Oota, Radhika Mamidi, Manish Gupta,
- Abstract summary: Transformer-based models have revolutionized the field of natural language processing.
How robust are these models in encoding linguistic properties when faced with perturbations in the input text?
In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages.
- Score: 14.77467551053299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://tinyurl.com/IndicSentEval}].
Related papers
- The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked.
Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs.
We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z) - Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Handling Compounding in Mobile Keyboard Input [7.309321705635677]
This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages.
Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models.
We show that this method brings around 20% word error rate reduction in a variety of compounding languages.
arXiv Detail & Related papers (2022-01-17T15:28:58Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Vy\=akarana: A Colorless Green Benchmark for Syntactic Evaluation in
Indic Languages [0.0]
Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology.
We introduce Vy=akarana: a benchmark of gender-balanced Colorless Green sentences in Indic languages for syntactic evaluation of multilingual language models.
We use the datasets from the evaluation tasks to probe five multilingual language models of varying architectures for syntax in Indic languages.
arXiv Detail & Related papers (2021-03-01T09:07:58Z) - RuSentEval: Linguistic Source, Encoder Force! [1.8160945635344525]
We introduce RuSentEval, an enhanced set of 14 probing tasks for Russian.
We apply a combination of complementary probing methods to explore the distribution of various linguistic properties in five multilingual transformers.
Our results provide intriguing findings that contradict the common understanding of how linguistic knowledge is represented.
arXiv Detail & Related papers (2021-02-28T17:43:42Z) - Neural Polysynthetic Language Modelling [15.257624461339867]
In high-resource languages, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types.
This assumes, that there are limited inflections per root, and that the majority will appear in a large enough corpus.
We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages.
arXiv Detail & Related papers (2020-05-11T22:57:04Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.