Fugu-MT 論文翻訳(概要): Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

論文の概要: Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

arxiv url: http://arxiv.org/abs/2606.09767v1
Date: Mon, 08 Jun 2026 17:29:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:07.593551
Title: Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan
Title（参考訳）: 低リソースNTTのためのデータ合成とパラメータ効率の良い微調整:Q'eqchi' Mayanを事例として
Authors: Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee,
Abstract要約: 本研究では,NMTモデルのブートストラップのためのデータ合成手法を提案する。我々は,コミュニティソース辞書を大規模合成コーパスに変換し,mT5ベースモデル上のLoRAアダプタを介してPEFT(Efficient Fine-Tuning)を利用する。有機用語集に対する評価は、文法的整合性を維持するが、自然言語の語彙的基盤を欠く構造的意味的ギャップ(BLEU 0.59)を明らかにする。
参考スコア（独自算出の注目度）: 42.654087108357594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.
Abstract（参考訳）: デジタル的に低リソースの内在言語に対するニューラルマシン翻訳は、しばしば極端なデータ不足によって妨げられ、抽出ウェブスクレイピングに依存する。そこで本研究では,NMTモデルのブートストラップにデータ合成手法を導入する。我々はQ'eqchi' Mayanに着目し、mT5ベースモデル上のLoRAアダプタを介してパラメータ効率の良いファインチューニング(PEFT)を利用して、コミュニティソース辞書を巨大な合成コーパスに変換する。ドメイン内評価は高い構造的獲得(BLEU 42.02)を示し、合成制約が複雑な凝集形態とVOS単語順序を効果的に教えていることを証明する。しかしながら、有機用語集に対する評価は、文法的整合性を維持しつつも、自然言語の語彙的基礎を欠いている構造的意味的ギャップ(BLEU 0.59)を明らかにする。このモデルは、合成テンプレートの構造的分散に過度に適合しており、パイプラインのセマンティックエントロピーが高いにもかかわらず、自然言語の構文的流動性に苦しむため、有機入力を厳密な学習パターンに強制する。さらに,マルチタスク学習アーキテクチャを用いたアブレーション研究により,LoRAアダプタ内でのパラメータ容量の制限に係わる補助的なタスクが,有機的柔軟性を犠牲にして合成マーカーの過度な最適化を引き起こしたことが示唆された。最終的に、合成ブートストラップは、非常に効果的な構造プライマーであるが、カリキュラム学習によるセマンティックリファインメントのための認証データが必要であることを確かめる。

論文の概要: Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

関連論文リスト