Fugu-MT 論文翻訳(概要): ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

論文の概要: ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

arxiv url: http://arxiv.org/abs/2603.10211v1
Date: Tue, 10 Mar 2026 20:24:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.677936
Title: ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation
Title（参考訳）: ViDia2Std:低リソースベトナム方言-標準翻訳のための並列コーパスと方法
Authors: Khoa Anh Ta, Nguyen Van Dinh, Kiet Van Nguyen,
Abstract要約: ViDia2Stdは、方言から標準ベトナム語への翻訳に初めて手動で注釈付けされた並列コーパスである。以前のデータセットとは異なり、ViDia2Stdは中央、南部、非標準北部の様々な方言を含んでいる。合意率は86%(北)、82%(中央)、85%(南)がViDia2Stdである。
参考スコア（独自算出の注目度）: 5.3220011447194215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.
Abstract（参考訳）: ベトナム語は幅広い方言のバリエーションを示し、標準ベトナムで主に訓練されたNLPシステムの課題を提起している。このようなシステムはしばしば方言の入力、特に中南部で表現されていない部分で性能が劣る。方言の正規化に関するこれまでの研究は、合成データと限られた方言の多様性を用いた中北方言の移動に焦点を当ててきた。これらの取り組みは北部の南部品種と地域内品種を除外している。 ViDia2Stdは、63の州にまたがる方言から標準ベトナム語への翻訳において、初めて手動で注釈付けされた並列コーパスである。以前のデータセットとは異なり、ViDia2Stdには、中央、南部、非標準北部の様々な方言があり、しばしば既存の資源から欠落している。このデータセットは、現実世界のFacebookコメントから得られた13,000以上の文ペアで構成され、3つの方言領域すべてにわたるネイティブスピーカーによって注釈付けされている。アノテーションの一貫性を評価するために,アノテーション間の同義的な標準マッピングを考慮に入れた意味マッピング合意尺度を定義する。この基準に基づき、合意率は86%(北)、82%(中央)、85%(南)と報告する。我々はViDia2Std上でシーケンス・ツー・シーケンス・モデルをいくつかベンチマークする。 mBART-large-50は最良の結果(BLEU 0.8166、ROUGE-L 0.9384、METEOR 0.8925)を得る一方、ViT5ベースは少ないパラメータで競合性能を提供する。 ViDia2Stdは、ベトナムの強靭なNLPシステムを構築する上で、方言を意識したリソースの必要性を強調し、方言正規化がダウンストリームタスクを大幅に改善することを示した。

論文の概要: ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

関連論文リスト