Computational historical linguistics and language diversity in South
Asia
- URL: http://arxiv.org/abs/2203.12524v1
- Date: Wed, 23 Mar 2022 16:36:24 GMT
- Title: Computational historical linguistics and language diversity in South
Asia
- Authors: Aryaman Arora, Adam Farris, Samopriya Basu, Suresh Kolichala
- Abstract summary: South Asia is home to a plethora of languages, many of which severely lack access to new language technologies.
This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics.
We claim that data scatteredness is the primary obstacle in the development of South Asian language technology.
- Score: 1.5293427903448025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: South Asia is home to a plethora of languages, many of which severely lack
access to new language technologies. This linguistic diversity also results in
a research environment conducive to the study of comparative, contact, and
historical linguistics -- fields which necessitate the gathering of extensive
data from many languages. We claim that data scatteredness (rather than
scarcity) is the primary obstacle in the development of South Asian language
technology, and suggest that the study of language history is uniquely aligned
with surmounting this obstacle. We review recent developments in and at the
intersection of South Asian NLP and historical-comparative linguistics,
describing our and others' current efforts in this area. We also offer new
strategies towards breaking the data barrier.
Related papers
- Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation [38.81102126876936]
This paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms.
To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers.
Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages.
arXiv Detail & Related papers (2024-11-18T05:41:27Z) - Socially Responsible Data for Large Multilingual Language Models [12.338723881042926]
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years.
Various efforts are striving for models to accommodate languages of communities outside of the Global North.
arXiv Detail & Related papers (2024-09-08T23:51:04Z) - Recent Advancements and Challenges of Turkic Central Asian Language Processing [4.189204855014775]
Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges.
Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
arXiv Detail & Related papers (2024-07-06T08:58:26Z) - A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.
Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.
This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z) - The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk.
This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z) - CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Cross-Cultural Transfer Learning for Chinese Offensive Language
Detection [9.341003339029221]
We aim to investigate the impact of transfer learning using offensive language detection data from different cultural backgrounds.
We find that culture-specific biases in what is considered offensive negatively impact the transferability of language models.
In a few-shot learning scenario, however, our study shows promising prospects for non-English offensive language detection with limited resources.
arXiv Detail & Related papers (2023-03-31T09:50:07Z) - Experience Grounds Language [185.73483760454454]
Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates.
Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world.
arXiv Detail & Related papers (2020-04-21T16:56:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.