LuxBank: The First Universal Dependency Treebank for Luxembourgish
- URL: http://arxiv.org/abs/2411.04813v1
- Date: Thu, 07 Nov 2024 15:50:40 GMT
- Title: LuxBank: The First Universal Dependency Treebank for Luxembourgish
- Authors: Alistair Plum, Caroline Döhmer, Emilia Milano, Anne-Marie Lutgen, Christoph Purschke,
- Abstract summary: Luxembourgish is a West Germanic language spoken by approximately 400,000 people.
We introduce LuxBank, the first Universal Dependencies (UD) Treebank for Luxembourgish.
- Score: 0.38447712214412116
- License:
- Abstract: The Universal Dependencies (UD) project has significantly expanded linguistic coverage across 161 languages, yet Luxembourgish, a West Germanic language spoken by approximately 400,000 people, has remained absent until now. In this paper, we introduce LuxBank, the first UD Treebank for Luxembourgish, addressing the gap in syntactic annotation and analysis for this `low-research' language. We establish formal guidelines for Luxembourgish language annotation, providing the foundation for the first large-scale quantitative analysis of its syntax. LuxBank serves not only as a resource for linguists and language learners but also as a tool for developing spell checkers and grammar checkers, organising existing text archives and even training large language models. By incorporating Luxembourgish into the UD framework, we aim to enhance the understanding of syntactic variation within West Germanic languages and offer a model for documenting smaller, semi-standardised languages. This work positions Luxembourgish as a valuable resource in the broader linguistic and NLP communities, contributing to the study of languages with limited research and resources.
Related papers
- The Zeno's Paradox of `Low-Resource' Languages [20.559416975723142]
We show how several interacting axes contribute to low-resourcedness' of a language.
We hope our work elicits explicit definitions of the terminology when it is used in papers.
arXiv Detail & Related papers (2024-10-28T08:05:34Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Spanish Resource Grammar version 2023 [12.009437358109407]
We present the latest version of the Spanish Resource Grammar (SRG)
Such grammars encode a complex set of hypotheses about syntax making them a resource for empirical testing of linguistic theory.
This version of the SRG uses the recent version of the Freeling morphological and is released along with an automatically created, manually verified treebank of 2,291 sentences.
arXiv Detail & Related papers (2023-09-23T09:24:05Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Low-resource Bilingual Dialect Lexicon Induction with Large Language
Models [24.080565202390314]
We present an analysis of the bilingual lexicon induction pipeline for German and two of its dialects, Bavarian and Alemannic.
This setup poses several challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects.
arXiv Detail & Related papers (2023-04-19T20:20:41Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - RuCoLA: Russian Corpus of Linguistic Acceptability [6.500438378175089]
We introduce the Russian Corpus of Linguistic Acceptability (RuCoLA)
RuCoLA consists of $9.8$k in-domain sentences from linguistic publications and $3.6$k out-of-domain sentences produced by generative models.
We demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors.
arXiv Detail & Related papers (2022-10-23T18:29:22Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.