Mapping Language Literacy At Scale: A Case Study on Facebook
- URL: http://arxiv.org/abs/2303.12179v1
- Date: Tue, 21 Mar 2023 20:24:13 GMT
- Title: Mapping Language Literacy At Scale: A Case Study on Facebook
- Authors: Yu-Ru Lin and Shaomei Wu and Winter Mason
- Abstract summary: This work systematically studies the language literacy skills of online populations for more than 160 countries and regions across the world.
We develop a population-level literacy estimate for the online population based on aggregated and de-identified public posts written by adult Facebook users globally.
We found that, on Facebook, women collectively show higher language literacy than men in many countries, but substantial gaps remain in Africa and Asia.
- Score: 6.402634424631123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Literacy is one of the most fundamental skills for people to access and
navigate today's digital environment. This work systematically studies the
language literacy skills of online populations for more than 160 countries and
regions across the world, including many low-resourced countries where official
literacy data are particularly sparse. Leveraging public data on Facebook, we
develop a population-level literacy estimate for the online population that is
based on aggregated and de-identified public posts written by adult Facebook
users globally, significantly improving both the coverage and resolution of
existing literacy tracking data. We found that, on Facebook, women collectively
show higher language literacy than men in many countries, but substantial gaps
remain in Africa and Asia. Further, our analysis reveals a considerable
regional gap within a country that is associated with multiple socio-technical
inequalities, suggesting an "inequality paradox" -- where the online language
skill disparity interacts with offline socioeconomic inequalities in complex
ways. These findings have implications for global women's empowerment and
socioeconomic inequalities.
Related papers
- Artificial intelligence is creating a new global linguistic hierarchy [34.80252741178931]
We present a global longitudinal analysis of social, economic and infrastructural conditions across languages.<n>We find that despite efforts to broaden the reach of language technologies, the dominance of a handful of languages is exacerbating disparities.<n>We introduce the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages.
arXiv Detail & Related papers (2026-02-12T14:50:44Z) - Do You Know About My Nation? Investigating Multilingual Language Models' Cultural Literacy Through Factual Knowledge [68.6805229085352]
Most multilingual question-answering benchmarks do not factor in regional diversity in the information they capture.<n>XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages.<n>We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics.
arXiv Detail & Related papers (2025-11-01T18:41:34Z) - Detecting Linguistic Diversity on Social Media [1.3108652488669732]
We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source.
We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models.
The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place.
arXiv Detail & Related papers (2025-02-28T16:56:34Z) - BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER -- a collection of multi-labeled datasets in 28 different languages.
We describe the data collection and annotation processes and the challenges of building these datasets.
We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - Bridging the Data Provenance Gap Across Text, Speech and Video [67.72097952282262]
We conduct the largest and first-of-its-kind longitudinal audit across modalities of popular text, speech, and video datasets.
Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.
We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets.
arXiv Detail & Related papers (2024-12-19T01:30:19Z) - The Call for Socially Aware Language Technologies [94.6762219597438]
We argue that many of these issues share a common core: a lack of awareness of the factors, context, and implications of the social environment in which NLP operates.
We argue that substantial challenges remain for NLP to develop social awareness and that we are just at the beginning of a new era for the field.
arXiv Detail & Related papers (2024-05-03T18:12:39Z) - Social Skill Training with Large Language Models [65.40795606463101]
People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life.
This perspective paper identifies social skill barriers to enter specialized fields.
We present a solution that leverages large language models for social skill training via a generic framework.
arXiv Detail & Related papers (2024-04-05T16:29:58Z) - Classist Tools: Social Class Correlates with Performance in NLP [27.683676116781758]
sociodemographic characteristics are infrequently used in Natural Language Processing.
We show that NLP disadvantages less-privileged socioeconomic groups.
We argue for the inclusion of socioeconomic class in future language technologies.
arXiv Detail & Related papers (2024-03-07T12:27:08Z) - Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future [59.78608958395464]
We build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets.
Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects.
We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
arXiv Detail & Related papers (2024-02-28T00:22:42Z) - Global Voices, Local Biases: Socio-Cultural Prejudices across Languages [22.92083941222383]
Human biases are ubiquitous but not uniform; disparities exist across linguistic, cultural, and societal borders.
In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies.
To encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more.
arXiv Detail & Related papers (2023-10-26T17:07:50Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Comparing Measures of Linguistic Diversity Across Social Media Language
Data and Census Data at Subnational Geographic Areas [1.0128808054306186]
This paper describes the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand.
We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations.
arXiv Detail & Related papers (2023-08-21T03:54:23Z) - Some Languages are More Equal than Others: Probing Deeper into the
Linguistic Disparity in the NLP World [2.0777058026628583]
Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently.
This paper provides a comprehensive analysis of the disparity that exists within the languages of the world.
arXiv Detail & Related papers (2022-10-16T12:50:30Z) - Mapping the Multilingual Margins: Intersectional Biases of Sentiment
Analysis Systems in English, Spanish, and Arabic [3.3458760961317635]
We introduce four multilingual Equity Evaluation Corpora, supplementary test sets designed to measure social biases, and a novel statistical framework for studying unisectional and intersectional social biases in natural language processing.
We use these tools to measure gender, racial, ethnic, and intersectional social biases across five models trained on emotion regression tasks in English, Spanish, and Arabic.
arXiv Detail & Related papers (2022-04-07T16:33:15Z) - Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases.
We propose steps towards mitigating social biases during text generation.
Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.