Fugu-MT 論文翻訳(概要): Resource-Lean Lexicon Induction for German Dialects

論文の概要: Resource-Lean Lexicon Induction for German Dialects

arxiv url: http://arxiv.org/abs/2604.23824v1
Date: Sun, 26 Apr 2026 18:09:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.575953
Title: Resource-Lean Lexicon Induction for German Dialects
Title（参考訳）: ドイツ語方言におけるリソースリーレキシコン誘導
Authors: Robert Litschko, Barbara Plank, Diego Frassinelli,
Abstract要約: 文字列類似性の特徴を訓練した統計モデルは、ドイツ語の方言辞書を誘導するのに驚くほど効果的であることを示す。それらは大きな言語モデルより優れ、クロスダイアレクト転送を可能にし、軽量なデータ駆動型代替手段を提供する。方言の資源不足に触発され、異なるドイツ語方言間でモデルがどの程度移動するかをさらに調査する。
参考スコア（独自算出の注目度）: 42.23792930877588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.
Abstract（参考訳）: 語彙資源の構築には高品質な辞書の自動生成が不可欠であるが、低リソース言語や方言には、アノテーションへのアクセスの制限、スペルの変化の度合いの向上、大規模言語モデル(LLM)の性能の低下など、いくつかの課題がある。文字列類似性の特徴を訓練した統計モデル(ランダム林)が、ドイツの方言辞書を誘導するのに驚くほど有効であることを示す。 LLMを上回り、クロスダイアレクト転送を可能にし、軽量なデータ駆動型代替手段を提供する。本稿では,バイリンガル語彙誘導(BLI)と方言情報検索(IR)を内在的に評価する。 BLIでは、無作為な森林がミストラル-123bを上回り、資源に恵まれている。 BM25を用いた方言IRでは、我々の方言辞書をクエリ拡張に用いると、nDCG@10では28.9%、Recall@100では50.7%の相対的な改善が得られる。方言の資源不足に触発されて、異なるドイツ語方言間でモデルが移行する程度と、その性能を様々な訓練データで調べる。

論文の概要: Resource-Lean Lexicon Induction for German Dialects

関連論文リスト