Related papers: Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists

Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists

URL: http://arxiv.org/abs/2302.00189v1
Date: Wed, 1 Feb 2023 02:44:28 GMT
Title: Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists
Authors: John E. Miller and Johann-Mattis List
Abstract summary: We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role. All methods perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.
Score: 3.096615629099617
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language contact is a pervasive phenomenon reflected in the borrowing of words from donor to recipient languages. Most computational approaches to borrowing detection treat all languages under study as equally important, even though dominant languages have a stronger impact on heritage languages than vice versa. We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role, applying two classical sequence comparison methods and one machine learning method to a sample of seven Latin American languages which have all borrowed extensively from Spanish. All methods perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.

Related papers

CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages [4.089936423985361]
We provide the first benchmark of detection methods focused on Central European languages.<n>We compare train-languages combinations to identify the best performing ones.<n>Supervised finetuned detectors in the Central European languages are found the most performant in these languages.
arXiv Detail & Related papers (2025-09-30T10:27:53Z)
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z)
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages [15.203789021094982]
In large language models (LLMs), how are multiple languages learned and encoded? We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages.
arXiv Detail & Related papers (2025-01-10T21:18:21Z)
How Do Multilingual Models Remember? Investigating Multilingual Factual Recall Mechanisms [50.13632788453612]
Large Language Models (LLMs) store and retrieve vast amounts of factual knowledge acquired during pre-training. The question of how these processes generalize to other languages and multilingual LLMs remains unexplored. We examine when language plays a role in the recall process, uncovering evidence of language-independent and language-dependent mechanisms.
arXiv Detail & Related papers (2024-10-18T11:39:34Z)
The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. We observe that the existence of a predominant language during training boosts the performance of less frequent languages. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z)
A Computational Model for the Assessment of Mutual Intelligibility Among Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it. Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments. We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z)
Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages [1.7622337807395716]
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages.
arXiv Detail & Related papers (2023-11-09T05:46:41Z)
Transfer Language Selection for Zero-Shot Cross-Lingual Abusive Language Detection [2.2998722397348335]
Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. Our datasets are from seven different languages from three language families.
arXiv Detail & Related papers (2022-06-02T09:53:15Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning [4.200736775540874]
We design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning. Most of the performance is given by the same small subset of attention heads for all studied languages.
arXiv Detail & Related papers (2021-06-22T21:25:43Z)
On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages. We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
Markov Chain Monte-Carlo Phylogenetic Inference Construction in Computational Historical Linguistics [0.0]
More and more languages in the world are under study nowadays, as a result, the traditional way of historical linguistics study is facing some challenges. In this paper, I am going to use computational method to cluster the languages and use Markov Chain Monte Carlo (MCMC) method to build the language typology relationship tree.
arXiv Detail & Related papers (2020-02-22T06:01:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.