MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank
- URL: http://arxiv.org/abs/2403.10293v1
- Date: Fri, 15 Mar 2024 13:33:10 GMT
- Title: MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank
- Authors: Verena Blaschke, Barbara Kovačić, Siyao Peng, Hinrich Schütze, Barbara Plank,
- Abstract summary: We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
- Score: 56.810282574817414
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.
Related papers
- Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data [19.914643388631728]
This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet)
The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information.
arXiv Detail & Related papers (2024-03-19T14:12:54Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Low-resource Bilingual Dialect Lexicon Induction with Large Language
Models [24.080565202390314]
We present an analysis of the bilingual lexicon induction pipeline for German and two of its dialects, Bavarian and Alemannic.
This setup poses several challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects.
arXiv Detail & Related papers (2023-04-19T20:20:41Z) - Sememe Prediction for BabelNet Synsets using Multilingual and Multimodal
Information [89.24684041258747]
Sememe knowledge bases (KBs) are built by manually annotating words with sememes.
Existing sememe KBs only cover a few languages, which hinders the wide utilization of sememes.
This paper aims to build a multilingual sememe KB based on BabelNet, a multilingual encyclopedia dictionary.
arXiv Detail & Related papers (2022-03-14T18:37:09Z) - SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Prague Dependency Treebank -- Consolidated 1.0 [1.7147127043116672]
Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0)
PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme.
Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation.
arXiv Detail & Related papers (2020-06-05T20:52:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.