Related papers: Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

URL: http://arxiv.org/abs/2404.18257v1
Date: Sun, 28 Apr 2024 17:43:24 GMT
Title: Mapping 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology
Authors: Nilo Pedrazzini,
Abstract summary: This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character $n$-grams and English $when$. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.

Related papers

Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation [9.23725598061561]
This study systematically compares three subword paradigms -- Byte Pair.<n>(BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages.<n>We show OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods.
arXiv Detail & Related papers (2026-02-04T05:59:25Z)
On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility [5.521655731616328]
Positional encodings are a direct target to investigate the implications of the trade-off hypothesis.<n>Contrary to previous findings, we do not observe a clear interaction between position encodings and morphological complexity or word order flexibility.<n>Our results show that the choice of tasks, languages, and metrics are essential for drawing stable conclusions.
arXiv Detail & Related papers (2025-11-11T11:50:21Z)
How Important Is Tokenization in French Medical Masked Language Models? [7.866517623371908]
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) This paper seeks to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks. We introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
arXiv Detail & Related papers (2024-02-22T23:11:08Z)
Complex systems approach to natural language [0.0]
Review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science. Three main complexity-related research trends in quantitative linguistics are covered.
arXiv Detail & Related papers (2024-01-05T12:01:26Z)
Exploring Linguistic Probes for Morphological Generalization [11.568042812213712]
Testing these probes on three morphologically distinct languages, we find evidence that three leading morphological inflection systems employ distinct generalization strategies over conjugational classes and feature sets on both orthographic and phonologically transcribed inputs.
arXiv Detail & Related papers (2023-10-20T17:45:30Z)
Analogy in Contact: Modeling Maltese Plural Inflection [4.83828446399992]
We quantify the extent to which the phonology and etymology of a Maltese singular noun may predict the morphological process. The results indicate phonological pressures shape the organization of the Maltese lexicon with predictive power.
arXiv Detail & Related papers (2023-05-20T20:16:57Z)
Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative. In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level. For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study. Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
Constructing a Family Tree of Ten Indo-European Languages with Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns. This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)
Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers. We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures. We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.