A large scale lexical and semantic analysis of Spanish language
variations in Twitter
- URL: http://arxiv.org/abs/2110.06128v1
- Date: Tue, 12 Oct 2021 16:21:03 GMT
- Title: A large scale lexical and semantic analysis of Spanish language
variations in Twitter
- Authors: Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario
Graff
- Abstract summary: This manuscript presents a broad analysis describing lexical and semantic relationships among 26 Spanish-speaking countries around the globe.
We analyze four-year of the Twitter geotagged public stream to provide an extensive survey of the Spanish language vocabularies of different countries.
- Score: 2.3511629321667096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dialectometry is a discipline devoted to studying the variations of a
language around a geographical region. One of their goals is the creation of
linguistic atlases capturing the similarities and differences of the language
under study around the area in question. For instance, Spanish is one of the
most spoken languages across the world, but not necessarily Spanish is written
and spoken in the same way in different countries. This manuscript presents a
broad analysis describing lexical and semantic relationships among 26
Spanish-speaking countries around the globe. For this study, we analyze
four-year of the Twitter geotagged public stream to provide an extensive survey
of the Spanish language vocabularies of different countries, its distributions,
semantic usage of terms, and emojis. We also offer open regional word-embedding
resources for Spanish Twitter to help other researchers and practitioners take
advantage of regionalized models.
Related papers
- Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs [0.4771833920251869]
This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish.<n>Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions.<n>Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels.
arXiv Detail & Related papers (2026-02-10T02:42:22Z) - Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-GuaranĂ [1.0248720782518987]
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaran.<n>Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences.<n>The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaran and informal Spanish in Paraguayan texts.
arXiv Detail & Related papers (2025-12-03T00:56:27Z) - Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense [30.62699081329474]
We introduce a novel benchmark for cross-lingual sense disambiguation, StingrayBench.
We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German.
In our analysis of various models, we observe they tend to be biased toward higher-resource languages.
arXiv Detail & Related papers (2024-10-28T22:09:43Z) - Historical Ink: Semantic Shift Detection for 19th Century Spanish [0.0]
This paper explores the evolution of word meanings in 19th-century Spanish texts, with an emphasis on Latin American Spanish.
It addresses the Semantic Shift Detection (SSD) task, which is crucial for understanding linguistic evolution.
arXiv Detail & Related papers (2024-07-08T16:49:34Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Lexical Diversity in Kinship Across Languages and Dialects [6.80465507148218]
We introduce a method to enrich computational lexicons with content relating to linguistic diversity.
The method is verified through two large-scale case studies on kinship terminology.
arXiv Detail & Related papers (2023-08-24T19:49:30Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - The Geometry of Multilingual Language Models: An Equality Lens [2.6746119935689214]
We analyze the geometry of three multilingual language models in Euclidean space.
Using a geometric separability index we find that although languages tend to be closer according to their linguistic family, they are almost separable with languages from other families.
arXiv Detail & Related papers (2023-05-13T05:19:15Z) - Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines [0.0]
We propose an Ecosystem Model for Linguistic Development with Sanskrit at the core.<n>We represent words across languages as state transitions on the phonetic map and construct corresponding Morphological Finite Automata.
arXiv Detail & Related papers (2023-01-29T15:22:10Z) - Spanish Legalese Language Model and Corpora [0.0629976670819788]
Legal slang could be think of a Spanish variant on its own as it is very complicated in vocabulary, semantics and phrase understanding.
For this work we gathered legal-domain corpora from different sources, generated a model and evaluated against Spanish general domain tasks.
arXiv Detail & Related papers (2021-10-23T12:06:51Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.