A Survey of Corpora for Germanic Low-Resource Languages and Dialects
- URL: http://arxiv.org/abs/2304.09805v1
- Date: Wed, 19 Apr 2023 16:45:16 GMT
- Title: A Survey of Corpora for Germanic Low-Resource Languages and Dialects
- Authors: Verena Blaschke, Hinrich Sch\"utze, Barbara Plank
- Abstract summary: This work focuses on low-resource languages and in particular non-standardized low-resource languages.
We make our overview of over 80 corpora publicly available to facilitate research.
- Score: 18.210880703295253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite much progress in recent years, the vast majority of work in natural
language processing (NLP) is on standard languages with many speakers. In this
work, we instead focus on low-resource languages and in particular
non-standardized low-resource languages. Even within branches of major language
families, often considered well-researched, little is known about the extent
and type of available resources and what the major NLP challenges are for these
language varieties. The first step to address this situation is a systematic
survey of available corpora (most importantly, annotated corpora, which are
particularly valuable for NLP research). Focusing on Germanic low-resource
language varieties, we provide such a survey in this paper. Except for
geolocation (origin of speaker or document), we find that manually annotated
linguistic resources are sparse and, if they exist, mostly cover morphosyntax.
Despite this lack of resources, we observe that interest in this area is
increasing: there is active development and a growing research community. To
facilitate research, we make our overview of over 80 corpora publicly
available. We share a companion website of this overview at
https://github.com/mainlp/germanic-lrl-corpora .
Related papers
- KyrgyzNLP: Challenges, Progress, and Future [1.1920184024241331]
Large language models (LLMs) have excelled in numerous benchmarks, advancing AI applications in both linguistic and non-linguistic tasks.
This has primarily benefited well-resourced languages, leaving less-resourced ones (LRLs) at a disadvantage.
In this paper, we highlight the current state of the NLP field in the specific LRL: kyrgyz tili.
arXiv Detail & Related papers (2024-11-08T12:03:31Z) - The Zeno's Paradox of `Low-Resource' Languages [20.559416975723142]
We show how several interacting axes contribute to low-resourcedness' of a language.
We hope our work elicits explicit definitions of the terminology when it is used in papers.
arXiv Detail & Related papers (2024-10-28T08:05:34Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - LLMs Are Few-Shot In-Context Low-Resource Language Learners [59.74451570590808]
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages.
We extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages.
Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs.
arXiv Detail & Related papers (2024-03-25T07:55:29Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Contextualising Levels of Language Resourcedness affecting Digital
Processing of Text [0.5620321106679633]
We argue that the dichotomous typology LRL and HRL for all languages is problematic.
The characterization is based on the typology of contextual features for each category, rather than counting tools.
arXiv Detail & Related papers (2023-09-29T07:48:24Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Toward More Meaningful Resources for Lower-resourced Languages [2.3513645401551333]
We examine the contents of the names stored in Wikidata for a few lower-resourced languages.
We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data.
We conclude with recommended guidelines for resource development.
arXiv Detail & Related papers (2022-02-24T18:39:57Z) - When Word Embeddings Become Endangered [0.685316573653194]
We present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and translation dictionaries of resource-poor languages.
All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
arXiv Detail & Related papers (2021-03-24T15:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.