What does it mean to be language-agnostic? Probing multilingual sentence
encoders for typological properties
- URL: http://arxiv.org/abs/2009.12862v1
- Date: Sun, 27 Sep 2020 15:00:52 GMT
- Title: What does it mean to be language-agnostic? Probing multilingual sentence
encoders for typological properties
- Authors: Rochelle Choenni, Ekaterina Shutova
- Abstract summary: We propose methods for probing sentence representations from state-of-the-art multilingual encoders.
Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies.
- Score: 17.404220737977738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual sentence encoders have seen much success in cross-lingual model
transfer for downstream NLP tasks. Yet, we know relatively little about the
properties of individual languages or the general patterns of linguistic
variation that they encode. We propose methods for probing sentence
representations from state-of-the-art multilingual encoders (LASER, M-BERT, XLM
and XLM-R) with respect to a range of typological properties pertaining to
lexical, morphological and syntactic structure. In addition, we investigate how
this information is distributed across all layers of the models. Our results
show interesting differences in encoding linguistic variation associated with
different pretraining strategies.
Related papers
- Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 [4.779196219827507]
We capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5.
Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others.
arXiv Detail & Related papers (2024-10-15T14:14:19Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Cross-neutralising: Probing for joint encoding of linguistic information
in multilingual models [17.404220737977738]
We study how relationships between languages are encoded in two state-of-the-art multilingual models.
The results suggest that linguistic properties are encoded jointly across typologically-similar languages.
arXiv Detail & Related papers (2020-10-24T07:55:32Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.