Evaluating Transformer-Based Multilingual Text Classification
- URL: http://arxiv.org/abs/2004.13939v2
- Date: Thu, 30 Apr 2020 20:31:38 GMT
- Title: Evaluating Transformer-Based Multilingual Text Classification
- Authors: Sophie Groenwold, Samhita Honnavalli, Lily Ou, Aesha Parekh, Sharon
Levy, Diba Mirza, William Yang Wang
- Abstract summary: We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
- Score: 55.53547556060537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As NLP tools become ubiquitous in today's technological landscape, they are
increasingly applied to languages with a variety of typological structures.
However, NLP research does not focus primarily on typological differences in
its analysis of state-of-the-art language models. As a result, NLP tools
perform unequally across languages with different syntactic and morphological
structures. Through a detailed discussion of word order typology, morphological
typology, and comparative linguistics, we identify which variables most affect
language modeling efficacy; in addition, we calculate word order and
morphological similarity indices to aid our empirical study. We then use this
background to support our analysis of an experiment we conduct using
multi-class text classification on eight languages and eight models.
Related papers
- Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages.
We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts.
We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z) - Morphological Typology in BPE Subword Productivity and Language Modeling [0.0]
We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized.
Experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization.
arXiv Detail & Related papers (2024-10-31T06:13:29Z) - A Joint Matrix Factorization Analysis of Multilingual Representations [28.751144371901958]
We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models.
We study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models.
arXiv Detail & Related papers (2023-10-24T04:43:45Z) - Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - On the Transferability of Neural Models of Morphological Analogies [7.89271130004391]
In this paper, we focus on morphological tasks and we propose a deep learning approach to detect morphological analogies.
We present an empirical study to see how our framework transfers across languages, and that highlights interesting similarities and differences between these languages.
In view of these results, we also discuss the possibility of building a multilingual morphological model.
arXiv Detail & Related papers (2021-08-09T11:08:33Z) - Morphology Matters: A Multilingual Language Modeling Analysis [8.791030561752384]
Prior studies disagree on whether inflectional morphology makes languages harder to model.
We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.
Several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data.
arXiv Detail & Related papers (2020-12-11T11:55:55Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Comparison of Turkish Word Representations Trained on Different
Morphological Forms [0.0]
This study prepared texts in morphologically different forms in a morphologically rich language, Turkish.
We trained word2vec model on texts which lemma and suffixes are treated differently.
We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.
arXiv Detail & Related papers (2020-02-13T10:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.