Comparing Measures of Linguistic Diversity Across Social Media Language
Data and Census Data at Subnational Geographic Areas
- URL: http://arxiv.org/abs/2308.10452v1
- Date: Mon, 21 Aug 2023 03:54:23 GMT
- Title: Comparing Measures of Linguistic Diversity Across Social Media Language
Data and Census Data at Subnational Geographic Areas
- Authors: Sidney G.-J. Wong, Jonathan Dunn and Benjamin Adams
- Abstract summary: This paper describes the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand.
We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations.
- Score: 1.0128808054306186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes a preliminary study on the comparative linguistic
ecology of online spaces (i.e., social media language data) and real-world
spaces in Aotearoa New Zealand (i.e., subnational administrative areas). We
compare measures of linguistic diversity between these different spaces and
discuss how social media users align with real-world populations. The results
from the current study suggests that there is potential to use online social
media language data to observe spatial and temporal changes in linguistic
diversity at subnational geographic areas; however, further work is required to
understand how well social media represents real-world behaviour.
Related papers
- Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future [59.78608958395464]
We build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets.
Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects.
We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
arXiv Detail & Related papers (2024-02-28T00:22:42Z) - Global Voices, Local Biases: Socio-Cultural Prejudices across Languages [22.92083941222383]
Human biases are ubiquitous but not uniform; disparities exist across linguistic, cultural, and societal borders.
In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies.
To encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more.
arXiv Detail & Related papers (2023-10-26T17:07:50Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Evolving linguistic divergence on polarizing social media [0.0]
We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji.
While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may arise.
arXiv Detail & Related papers (2023-09-04T15:21:55Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Language statistics at different spatial, temporal, and grammatical
scales [48.7576911714538]
We use data from Twitter to explore the rank diversity at different scales.
The greatest changes come from variations in the grammatical scale.
As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales.
arXiv Detail & Related papers (2022-07-02T01:38:48Z) - Geolocation differences of language use in urban areas [0.0]
We explore the use of Twitter data with precise geolocation information to resolve spatial variations in language use on an urban scale down to single city blocks.
Our work shows that analysis of small-scale variations can provide unique information on correlations between language use and social context.
arXiv Detail & Related papers (2021-08-01T19:55:45Z) - Measuring Linguistic Diversity During COVID-19 [1.0312968200748118]
This paper calibrates measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic.
Previous work has mapped the distribution of languages using geo-referenced social media and web data.
This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora introduced by non-local populations.
arXiv Detail & Related papers (2021-04-03T02:09:37Z) - Characterizing English Variation across Social Media Communities with
BERT [9.98785450861229]
We analyze two months of English comments in 474 Reddit communities.
The specificity of different sense clusters to a community, combined with the specificity of a community's unique word types, is used to identify cases where a social group's language deviates from the norm.
We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.
arXiv Detail & Related papers (2021-02-12T23:50:57Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - Experience Grounds Language [185.73483760454454]
Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates.
Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world.
arXiv Detail & Related papers (2020-04-21T16:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.