Some Languages are More Equal than Others: Probing Deeper into the
Linguistic Disparity in the NLP World
- URL: http://arxiv.org/abs/2210.08523v2
- Date: Thu, 20 Oct 2022 00:24:16 GMT
- Title: Some Languages are More Equal than Others: Probing Deeper into the
Linguistic Disparity in the NLP World
- Authors: Surangika Ranathunga and Nisansa de Silva
- Abstract summary: Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently.
This paper provides a comprehensive analysis of the disparity that exists within the languages of the world.
- Score: 2.0777058026628583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linguistic disparity in the NLP world is a problem that has been widely
acknowledged recently. However, different facets of this problem, or the
reasons behind this disparity are seldom discussed within the NLP community.
This paper provides a comprehensive analysis of the disparity that exists
within the languages of the world. We show that simply categorising languages
considering data availability may not be always correct. Using an existing
language categorisation based on speaker population and vitality, we analyse
the distribution of language data resources, amount of NLP/CL research,
inclusion in multilingual web-based platforms and the inclusion in pre-trained
multilingual models. We show that many languages do not get covered in these
resources or platforms, and even within the languages belonging to the same
language group, there is wide disparity. We analyse the impact of family,
geographical location, GDP and the speaker population of languages and provide
possible reasons for this disparity, along with some suggestions to overcome
the same.
Related papers
- What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - What is "Typological Diversity" in NLP? [7.58293347591642]
We introduce metrics to approximate the diversity of language selection along several axes.
We show that skewed language selection can lead to overestimated multilingual performance.
arXiv Detail & Related papers (2024-02-06T18:29:39Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Language Chameleon: Transformation analysis between languages using
Cross-lingual Post-training based on Pre-trained language models [4.731313022026271]
In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT)
Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
arXiv Detail & Related papers (2022-09-14T05:20:52Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - The State and Fate of Linguistic Diversity and Inclusion in the NLP
World [12.936270946393483]
Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
arXiv Detail & Related papers (2020-04-20T07:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.