Language statistics at different spatial, temporal, and grammatical
scales
- URL: http://arxiv.org/abs/2207.00709v1
- Date: Sat, 2 Jul 2022 01:38:48 GMT
- Title: Language statistics at different spatial, temporal, and grammatical
scales
- Authors: Fernanda S\'anchez-Puig, Rogelio Lozano-Aranda, Dante
P\'erez-M\'endez, Ewan Colman, Alfredo J. Morales-Guzm\'an, Carlos Pineda,
and Carlos Gershenson
- Abstract summary: We use data from Twitter to explore the rank diversity at different scales.
The greatest changes come from variations in the grammatical scale.
As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales.
- Score: 48.7576911714538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statistical linguistics has advanced considerably in recent decades as data
has become available. This has allowed researchers to study how statistical
properties of languages change over time. In this work, we use data from
Twitter to explore English and Spanish considering the rank diversity at
different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to
3000+km radii), and grammatical (from monograms to pentagrams). We find that
all three scales are relevant. However, the greatest changes come from
variations in the grammatical scale. At the lowest grammatical scale
(monograms), the rank diversity curves are most similar, independently on the
values of other scales, languages, and countries. As the grammatical scale
grows, the rank diversity curves vary more depending on the temporal and
spatial scales, as well as on the language and country. We also study the
statistics of Twitter-specific tokens: emojis, hashtags, and user mentions.
These particular type of tokens show a sigmoid kind of behaviour as a rank
diversity function. Our results are helpful to quantify aspects of language
statistics that seem universal and what may lead to variations.
Related papers
- Evolving linguistic divergence on polarizing social media [0.0]
We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji.
While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may arise.
arXiv Detail & Related papers (2023-09-04T15:21:55Z) - Exploring Anisotropy and Outliers in Multilingual Language Models for
Cross-Lingual Semantic Sentence Similarity [64.18762301574954]
Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings.
This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context.
We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models.
arXiv Detail & Related papers (2023-06-01T09:01:48Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - Geolocation differences of language use in urban areas [0.0]
We explore the use of Twitter data with precise geolocation information to resolve spatial variations in language use on an urban scale down to single city blocks.
Our work shows that analysis of small-scale variations can provide unique information on correlations between language use and social context.
arXiv Detail & Related papers (2021-08-01T19:55:45Z) - A Statistical Model of Word Rank Evolution [1.1011268090482575]
This work explores the word rank dynamics of eight languages by investigating the Google Books corpus unigram frequency data set.
We observed the rank changes of the unigrams from 1900 to 2008 and compared it to a Wright-Fisher inspired model that we developed for our analysis.
arXiv Detail & Related papers (2021-07-21T08:57:32Z) - Capturing the diversity of multilingual societies [0.0]
We consider the processes at work in language shift through a conjunction of theoretical and data-driven perspectives.
A large-scale empirical study of spatial patterns of languages in multilingual societies using Twitter and census data yields a wide diversity.
We propose a model in which coexistence of languages may be reached when learning the other language is facilitated and when bilinguals favor the use of the endangered language.
arXiv Detail & Related papers (2021-05-06T10:27:43Z) - Measuring Linguistic Diversity During COVID-19 [1.0312968200748118]
This paper calibrates measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic.
Previous work has mapped the distribution of languages using geo-referenced social media and web data.
This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora introduced by non-local populations.
arXiv Detail & Related papers (2021-04-03T02:09:37Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.