Related papers: A Survey of Corpora for Germanic Low-Resource Languages and Dialects

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

URL: http://arxiv.org/abs/2304.09805v1
Date: Wed, 19 Apr 2023 16:45:16 GMT
Title: A Survey of Corpora for Germanic Low-Resource Languages and Dialects
Authors: Verena Blaschke, Hinrich Sch\"utze, Barbara Plank
Abstract summary: This work focuses on low-resource languages and in particular non-standardized low-resource languages. We make our overview of over 80 corpora publicly available to facilitate research.
Score: 18.210880703295253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available. We share a companion website of this overview at https://github.com/mainlp/germanic-lrl-corpora .

Related papers

Improving Low-Resource Retrieval Effectiveness using Zero-Shot Linguistic Similarity Transfer [23.572881425446074]
Globalisation and colonisation have led the vast majority of the world to use only a fraction of languages, such as English and French. This has severely affected the survivability of many now-deemed vulnerable or endangered languages, such as Occitan and Sicilian. We show that current search systems are not robust across language varieties, severely affecting retrieval effectiveness. We propose fine-tuning neural rankers on pairs of language varieties, thereby exposing them to their linguistic similarities.
arXiv Detail & Related papers (2025-03-28T15:10:19Z)
KyrgyzNLP: Challenges, Progress, and Future [1.1920184024241331]
Large language models (LLMs) have excelled in numerous benchmarks, advancing AI applications in both linguistic and non-linguistic tasks. This has primarily benefited well-resourced languages, leaving less-resourced ones (LRLs) at a disadvantage. In this paper, we highlight the current state of the NLP field in the specific LRL: kyrgyz tili.
arXiv Detail & Related papers (2024-11-08T12:03:31Z)
The Zeno's Paradox of `Low-Resource' Languages [20.559416975723142]
We show how several interacting axes contribute to low-resourcedness' of a language. We hope our work elicits explicit definitions of the terminology when it is used in papers.
arXiv Detail & Related papers (2024-10-28T08:05:34Z)
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z)
LLMs Are Few-Shot In-Context Low-Resource Language Learners [59.74451570590808]
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages. We extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs.
arXiv Detail & Related papers (2024-03-25T07:55:29Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Contextualising Levels of Language Resourcedness affecting Digital Processing of Text [0.5620321106679633]
We argue that the dichotomous typology LRL and HRL for all languages is problematic. The characterization is based on the typology of contextual features for each category, rather than counting tools.
arXiv Detail & Related papers (2023-09-29T07:48:24Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages. We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z)
Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data. We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z)
When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.