Detecting Linguistic Diversity on Social Media
- URL: http://arxiv.org/abs/2502.21224v1
- Date: Fri, 28 Feb 2025 16:56:34 GMT
- Title: Detecting Linguistic Diversity on Social Media
- Authors: Sidney Wong, Benjamin Adams, Jonathan Dunn,
- Abstract summary: We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source.<n>We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models.<n>The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place.
- Score: 1.3108652488669732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This chapter explores the efficacy of using social media data to examine changing linguistic behaviour of a place. We focus our investigation on Aotearoa New Zealand where official statistics from the census is the only source of language use data. We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We use place as the common denominator between the two data sources. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. We then compare levels of linguistic diversity at national, regional, and local geographies. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place. We show that social media is sensitive to demographic and sociopolitical changes within a language and at low-level regional and local geographies.
Related papers
- Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia [49.80565462746646]
We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level.
We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias.
arXiv Detail & Related papers (2024-10-05T20:40:49Z) - From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets [10.264294331399434]
Hate speech datasets have traditionally been developed by language.
We evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography.
We find that HS datasets for English, Arabic and Spanish exhibit a strong geo-cultural bias.
arXiv Detail & Related papers (2024-04-27T12:10:10Z) - Global Voices, Local Biases: Socio-Cultural Prejudices across Languages [22.92083941222383]
Human biases are ubiquitous but not uniform; disparities exist across linguistic, cultural, and societal borders.
In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies.
To encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more.
arXiv Detail & Related papers (2023-10-26T17:07:50Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multimodal Modeling For Spoken Language Identification [57.94119986116947]
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance.
We propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification.
arXiv Detail & Related papers (2023-09-19T12:21:39Z) - Comparing Measures of Linguistic Diversity Across Social Media Language
Data and Census Data at Subnational Geographic Areas [1.0128808054306186]
This paper describes the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand.
We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations.
arXiv Detail & Related papers (2023-08-21T03:54:23Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Geolocation differences of language use in urban areas [0.0]
We explore the use of Twitter data with precise geolocation information to resolve spatial variations in language use on an urban scale down to single city blocks.
Our work shows that analysis of small-scale variations can provide unique information on correlations between language use and social context.
arXiv Detail & Related papers (2021-08-01T19:55:45Z) - Words are the Window to the Soul: Language-based User Representations
for Fake News Detection [5.876243339384605]
We introduce a model that creates representations of individuals on social media based only on the language they produce.
We show that language-based user representations are beneficial for this task.
arXiv Detail & Related papers (2020-11-14T21:14:17Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.