A Computational Approach to Measuring the Semantic Divergence of
Cognates
- URL: http://arxiv.org/abs/2012.01288v1
- Date: Wed, 2 Dec 2020 15:52:38 GMT
- Title: A Computational Approach to Measuring the Semantic Divergence of
Cognates
- Authors: Ana-Sabina Uban, Alina-Maria Ciobanu, Liviu P. Dinu
- Abstract summary: We investigate semantic divergence across languages by measuring the semantic similarity of cognate sets in multiple languages.
A language-agnostic method facilitates a quantitative analysis of cognates divergence.
We introduce the notion of "soft false friend" and "hard false friend", as well as a measure of the degree of "falseness" of a false friends pair.
- Score: 2.66418345185993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Meaning is the foundation stone of intercultural communication. Languages are
continuously changing, and words shift their meanings for various reasons.
Semantic divergence in related languages is a key concern of historical
linguistics. In this paper we investigate semantic divergence across languages
by measuring the semantic similarity of cognate sets in multiple languages. The
method that we propose is based on cross-lingual word embeddings. In this paper
we implement and evaluate our method on English and five Romance languages, but
it can be extended easily to any language pair, requiring only large
monolingual corpora for the involved languages and a small bilingual dictionary
for the pair. This language-agnostic method facilitates a quantitative analysis
of cognates divergence -- by computing degrees of semantic similarity between
cognate pairs -- and provides insights for identifying false friends. As a
second contribution, we formulate a straightforward method for detecting false
friends, and introduce the notion of "soft false friend" and "hard false
friend", as well as a measure of the degree of "falseness" of a false friends
pair. Additionally, we propose an algorithm that can output suggestions for
correcting false friends, which could result in a very helpful tool for
language learning or translation.
Related papers
- Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer [13.630754537249707]
Tokenization defines the foundation of multilingual language models.<n>New framework trains tokenizers monolingually and aligns vocabularies exhaustively using bilingual dictionaries or word-to-word translation.
arXiv Detail & Related papers (2025-10-07T17:05:49Z) - False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages.<n>Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages?<n>We find that models with overlap outperform models with disjoint vocabularies.
arXiv Detail & Related papers (2025-09-23T07:47:54Z) - Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense [30.62699081329474]
We introduce a novel benchmark for cross-lingual sense disambiguation, StingrayBench.
We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German.
In our analysis of various models, we observe they tend to be biased toward higher-resource languages.
arXiv Detail & Related papers (2024-10-28T22:09:43Z) - A Computational Model for the Assessment of Mutual Intelligibility Among
Closely Related Languages [1.5773159234875098]
Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it.
Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments.
We propose a computer-assisted method using the Linear Discriminative Learner to approximate the cognitive processes by which humans learn languages.
arXiv Detail & Related papers (2024-02-05T11:32:13Z) - Are Mutually Intelligible Languages Easier to Translate? [30.41671642147019]
We show that the amount of data needed to train a neural ma-chine translation model is anti-proportional to the languages' mutual intelligibility.
Experiments on the Romance language group reveal that there is indeed strong correlation between the area under a model's learning curve and mutual intelligibility scores obtained by studying human speakers.
arXiv Detail & Related papers (2022-01-31T09:22:23Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Linguistic Classification using Instance-Based Learning [0.0]
We take a contrarian approach and question the tree-based model that is rather restrictive.
For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model.
We can say the same about inter-relationship between languages in India, where the inter-relationships are better discovered than assumed.
arXiv Detail & Related papers (2020-12-02T04:12:10Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.