Measuring Linguistic Diversity During COVID-19
- URL: http://arxiv.org/abs/2104.01290v1
- Date: Sat, 3 Apr 2021 02:09:37 GMT
- Title: Measuring Linguistic Diversity During COVID-19
- Authors: Jonathan Dunn and Tom Coupe and Benjamin Adams
- Abstract summary: This paper calibrates measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic.
Previous work has mapped the distribution of languages using geo-referenced social media and web data.
This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora introduced by non-local populations.
- Score: 1.0312968200748118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computational measures of linguistic diversity help us understand the
linguistic landscape using digital language data. The contribution of this
paper is to calibrate measures of linguistic diversity using restrictions on
international travel resulting from the COVID-19 pandemic. Previous work has
mapped the distribution of languages using geo-referenced social media and web
data. The goal, however, has been to describe these corpora themselves rather
than to make inferences about underlying populations. This paper shows that a
difference-in-differences method based on the Herfindahl-Hirschman Index can
identify the bias in digital corpora that is introduced by non-local
populations. These methods tell us where significant changes have taken place
and whether this leads to increased or decreased diversity. This is an
important step in aligning digital corpora like social media with the
real-world populations that have produced them.
Related papers
- Variationist: Exploring Multifaceted Variation and Bias in Written Language Data [3.666781404469562]
Exploring and understanding language data is a fundamental stage in all areas dealing with human language.
Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias.
In this paper, we introduce Variationist, a highly-modular, descriptive, and task-agnostic tool that fills this gap.
arXiv Detail & Related papers (2024-06-25T15:41:07Z) - Global Voices, Local Biases: Socio-Cultural Prejudices across Languages [22.92083941222383]
Human biases are ubiquitous but not uniform; disparities exist across linguistic, cultural, and societal borders.
In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies.
To encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more.
arXiv Detail & Related papers (2023-10-26T17:07:50Z) - Computer Vision Datasets and Models Exhibit Cultural and Linguistic
Diversity in Perception [28.716435050743957]
We study how people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli.
By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression.
Our work points towards the need to accounttuning for and embrace the diversity of human perception in the computer vision community.
arXiv Detail & Related papers (2023-10-22T16:51:42Z) - Comparing Measures of Linguistic Diversity Across Social Media Language
Data and Census Data at Subnational Geographic Areas [1.0128808054306186]
This paper describes the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand.
We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations.
arXiv Detail & Related papers (2023-08-21T03:54:23Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems.
Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts.
We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z) - Cross-Linguistic Syntactic Difference in Multilingual BERT: How Good is
It and How Does It Affect Transfer? [50.48082721476612]
Multilingual BERT (mBERT) has demonstrated considerable cross-lingual syntactic ability.
We investigate the distributions of grammatical relations induced from mBERT in the context of 24 typologically different languages.
arXiv Detail & Related papers (2022-12-21T09:44:08Z) - Language statistics at different spatial, temporal, and grammatical
scales [48.7576911714538]
We use data from Twitter to explore the rank diversity at different scales.
The greatest changes come from variations in the grammatical scale.
As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales.
arXiv Detail & Related papers (2022-07-02T01:38:48Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.