Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
- URL: http://arxiv.org/abs/2602.16852v2
- Date: Wed, 25 Feb 2026 16:49:29 GMT
- Title: Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
- Authors: Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense,
- Abstract summary: Meenzerisch is the dialect spoken in the German city of Mainz.<n>Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects.<n>This work presents the first research in the field of NLP that is explicitly focused on Meenzerisch.
- Score: 17.504351782064113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.
Related papers
- Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries [0.0]
This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs)<n>The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset.
arXiv Detail & Related papers (2025-10-09T08:29:22Z) - Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora [38.54622638611305]
We use Bavarian as a case study and investigate the lexical dialect understanding capability of Large Language Models (LLMs)<n>We use DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only.<n>We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma.
arXiv Detail & Related papers (2025-09-22T14:49:08Z) - Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish? [0.0]
Traditional lexicon-based tools such as LIWC and Pattern have long served as foundational instruments in this domain.<n>We first conducted a study involving approximately 25,000 textual responses from 102 Dutch-speaking participants.<n>We assessed the performance of three Dutch-specific LLMs in predicting these valence scores, and compared their outputs to those generated by LIWC and Pattern.<n>This study underscores the imperative for developing culturally and linguistically tailored models/tools that can adeptly handle the complexities of natural language use.
arXiv Detail & Related papers (2025-06-04T16:31:37Z) - Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects [0.0]
We aim to pioneer NLP technologies for Comorian, a group of four languages or dialects belonging to the Bantu family.<n>Our approach is motivated by the hypothesis that if a human can understand a different language from their native language with little or no effort, it would be entirely possible to model this process on a machine.
arXiv Detail & Related papers (2024-12-09T22:47:41Z) - Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - Are BabyLMs Second Language Learners? [48.85680614529188]
This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge.
Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective.
arXiv Detail & Related papers (2024-10-28T17:52:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German [22.30271453485001]
We introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference.
Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German.
arXiv Detail & Related papers (2021-03-21T14:00:09Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.