The State and Fate of Linguistic Diversity and Inclusion in the NLP
World
- URL: http://arxiv.org/abs/2004.09095v3
- Date: Wed, 27 Jan 2021 03:39:20 GMT
- Title: The State and Fate of Linguistic Diversity and Inclusion in the NLP
World
- Authors: Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit
Choudhury
- Abstract summary: Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
- Score: 12.936270946393483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language technologies contribute to promoting multilingualism and linguistic
diversity around the world. However, only a very small number of the over 7000
languages of the world are represented in the rapidly evolving language
technologies and applications. In this paper we look at the relation between
the types of languages, resources, and their representation in NLP conferences
to understand the trajectory that different languages have followed over time.
Our quantitative investigation underlines the disparity between languages,
especially in terms of their resources, and calls into question the "language
agnostic" status of current models and systems. Through this paper, we attempt
to convince the ACL community to prioritise the resolution of the predicaments
highlighted here, so that no language is left behind.
Related papers
- Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs)
It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs.
It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Natural Language Processing RELIES on Linguistics [13.142686158720021]
We argue the acronym RELIES that encapsulates six major facets where linguistics contributes to NLP.
This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes.
arXiv Detail & Related papers (2024-05-09T17:59:32Z) - Multilingual Text Representation [3.4447129363520337]
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
arXiv Detail & Related papers (2023-09-02T14:21:22Z) - Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages.
We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented.
We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z) - GlobalBench: A Benchmark for Global Progress in Natural Language
Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages.
Tracks estimated per-speaker utility and equity of technology across all languages.
Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z) - Some Languages are More Equal than Others: Probing Deeper into the
Linguistic Disparity in the NLP World [2.0777058026628583]
Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently.
This paper provides a comprehensive analysis of the disparity that exists within the languages of the world.
arXiv Detail & Related papers (2022-10-16T12:50:30Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - Systematic Inequalities in Language Technology Performance across the
World's Languages [94.65681336393425]
We introduce a framework for estimating the global utility of language technologies.
Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies and more linguistic NLP tasks.
arXiv Detail & Related papers (2021-10-13T14:03:07Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.