The interplay between morphological typology and script on a novel
multi-layer Algerian dialect corpus
- URL: http://arxiv.org/abs/2105.07400v1
- Date: Sun, 16 May 2021 10:22:21 GMT
- Title: The interplay between morphological typology and script on a novel
multi-layer Algerian dialect corpus
- Authors: Samia Touileb and Jeremy Barnes
- Abstract summary: We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts.
We find there is a delicate relationship between script and typology for part-of-speech, while sentiment analysis is less sensitive.
- Score: 4.974890682815778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen a rise in interest for cross-lingual transfer between
languages with similar typology, and between languages of various scripts.
However, the interplay between morphological typology and difference in script
on cross-lingual transfer is a less studied problem. We explore this interplay
on cross-lingual transfer for two supervised tasks, namely part-of-speech
tagging and sentiment analysis. We introduce a newly annotated corpus of
Algerian user-generated comments comprising parallel annotations of Algerian
written in Latin, Arabic, and code-switched scripts, as well as annotations for
sentiment and topic categories. We perform baseline experiments by fine-tuning
multi-lingual language models. We further explore the effect of script vs.
morphological typology in cross-lingual transfer by fine-tuning multi-lingual
models on languages which are a) morphologically distinct, but use the same
script, b) morphologically similar, but use a distinct script, or c) are
morphologically similar and use the same script. We find there is a delicate
relationship between script and typology for part-of-speech, while sentiment
analysis is less sensitive.
Related papers
- Unknown Script: Impact of Script on Cross-Lingual Transfer [2.5398014196797605]
Cross-lingual transfer has become an effective way of transferring knowledge between languages.
We consider a case where the target language and its script are not part of the pre-trained model.
Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.
arXiv Detail & Related papers (2024-04-29T15:48:01Z) - Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology [0.0]
This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean.
It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors.
The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.
arXiv Detail & Related papers (2024-04-28T17:43:24Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - What does it mean to be language-agnostic? Probing multilingual sentence
encoders for typological properties [17.404220737977738]
We propose methods for probing sentence representations from state-of-the-art multilingual encoders.
Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies.
arXiv Detail & Related papers (2020-09-27T15:00:52Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.