Statistical analysis of word flow among five Indo-European languages
- URL: http://arxiv.org/abs/2301.06985v1
- Date: Tue, 17 Jan 2023 16:12:42 GMT
- Title: Statistical analysis of word flow among five Indo-European languages
- Authors: Josu\'e Ely Molina, Jorge Flores, Carlos Gershenson and Carlos Pineda
- Abstract summary: We use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish.
We study what we define as migrant words'', a type of loanwords that do not change their spelling.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent increase in data availability has allowed the possibility to perform
different statistical linguistic studies. Here we use the Google Books Ngram
dataset to analyze word flow among English, French, German, Italian, and
Spanish. We study what we define as ``migrant words'', a type of loanwords that
do not change their spelling. We quantify migrant words from one language to
another for different decades, and notice that most migrant words can be
aggregated in semantic fields and associated to historic events. We also study
the statistical properties of accumulated migrant words and their rank
dynamics. We propose a measure of use of migrant words that could be used as a
proxy of cultural influence. Our methodology is not exempt of caveats, but our
results are encouraging to promote further studies in this direction.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - Crowdsourcing Lexical Diversity [7.569845058082537]
This paper proposes a novel crowdsourcing methodology for reducing bias in lexicons.
Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food.
We validated our method by applying it to two case studies focused on food-related terminology.
arXiv Detail & Related papers (2024-10-30T15:45:09Z) - Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense [30.62699081329474]
We introduce a novel benchmark for cross-lingual sense disambiguation, StingrayBench.
We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German.
In our analysis of various models, we observe they tend to be biased toward higher-resource languages.
arXiv Detail & Related papers (2024-10-28T22:09:43Z) - MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially
Euphemistic Terms [10.154915854525928]
We train a multilingual transformer model (XLM-RoBERTa) to disambiguate potentially euphemistic terms (PETs) in multilingual and cross-lingual settings.
We show that multilingual models perform better on the task compared to monolingual models by a statistically significant margin.
In a follow-up analysis, we focus on universal euphemistic "categories" such as death and bodily functions among others.
arXiv Detail & Related papers (2024-01-25T21:38:30Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Lexical Diversity in Kinship Across Languages and Dialects [6.80465507148218]
We introduce a method to enrich computational lexicons with content relating to linguistic diversity.
The method is verified through two large-scale case studies on kinship terminology.
arXiv Detail & Related papers (2023-08-24T19:49:30Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Language statistics at different spatial, temporal, and grammatical
scales [48.7576911714538]
We use data from Twitter to explore the rank diversity at different scales.
The greatest changes come from variations in the grammatical scale.
As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales.
arXiv Detail & Related papers (2022-07-02T01:38:48Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Using Known Words to Learn More Words: A Distributional Analysis of
Child Vocabulary Development [0.0]
We investigated item-based variability in vocabulary development using lexical properties of distributional statistics.
We predicted word trajectories cross-sectionally, shedding light on trends in vocabulary development that may not have been evident at a single time point.
We also show that whether one looks at a single age group or across ages as a whole, the best distributional predictor of whether a child knows a word is the number of other known words with which that word tends to co-occur.
arXiv Detail & Related papers (2020-09-15T01:18:21Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.