Fugu-MT 論文翻訳(概要): Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

論文の概要: Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

arxiv url: http://arxiv.org/abs/2606.24169v1
Date: Tue, 23 Jun 2026 05:51:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.795561
Title: Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR
Title（参考訳）: データスケールはレイテンシではなく,ストリーミングASRにおける言語間エンコーダ転送を形作る
Authors: Nenad Banfic,
Abstract要約: ストリーミング音声認識モデルを新しい言語に適応させるには、2つの妥当なウォームスタートを選択する必要がある。一般的な直観は、多言語エンコーダは低データにおいて最も役立ちます。どの程度のメリットが持続するか、ストリーミングレイテンシの厳しさが増幅されるか、デプロイメントの量子化を生き残るかは、はっきりしない。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adapting a streaming speech recognition model to a new language requires choosing between two plausible warm starts: a multilingual (ML) encoder or an English-only (EN) encoder. The common intuition is that the multilingual encoder should help most at low data, but it is unclear how long that advantage persists, whether tight streaming latency amplifies it, and whether it survives deployment quantization. We answer these questions with a controlled sweep of a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, up to five target-language data scales (100 h to 2500 h), three streaming tiers plus offline decoding, and up to four public test sets. The main result is that multilingual initialization is a data-limited advantage, not a latency-limited one. On FLEURS at 160 ms, the mean EN-ML word error rate (WER) gap falls from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h; a power-law fit summarizes this decay, with each doubling of target-language data roughly halving the remaining advantage. Across the three streaming tiers, the across-language mean EN-ML gap is approximately stable at each scale from 100 to 1000 h, and is near zero by 2500 h. Finally, 4-bit weight-only encoder quantization at the matched 560 ms streaming tier reduces the encoder footprint by about 3x, with an average FLEURS WER increase of about 0.5 pp. The resulting guideline is simple: use multilingual initialization in low-data regimes, treat the choice as effectively irrelevant at large data, and make latency and quantization decisions independently.
Abstract（参考訳）: ストリーミング音声認識モデルを新しい言語に適応させるには、多言語(ML)エンコーダと英語(EN)エンコーダの2つの有効なウォームスタートを選択する必要がある。一般的な直感では、マルチリンガルエンコーダは、低データにおいて最も役立ちますが、そのメリットがどれくらい長く持続するか、ストリーミング遅延がそれを増幅するかどうか、デプロイメントの量子化を生き残るかは定かではありません。これらの質問には,最大5つのターゲット言語データスケール(100hから2500h),3つのストリーミングティアとオフラインデコーディング,最大4つのパブリックテストセットを対象とする,0.6Bパラメータキャッシュ対応のFastConformerトランスデューサをコントロールして答える。主な結果は、マルチ言語の初期化がデータ制限の利点であり、レイテンシ制限の利点ではないということだ。 160msのFLEURSでは、平均EN-MLワードエラー率(WER)ギャップは100hの+4.21ポイント(pp)から2500hの+0.20ppに減少する。 3つのストリーミング層全体では、言語間の平均EN-MLギャップは100から1000hのスケールでほぼ安定であり、2500hのゼロに近い。最後に、一致した560msのストリーミング層での4ビットの重みのみのエンコーダ量子化により、エンコーダのフットプリントが約3倍減少し、FLEURS WERの平均は0.5ppである。結果として得られるガイドラインは単純で、低データのレシエーションで多言語の初期化を使い、選択を大規模データでは効果的に無関係として扱い、レイテンシと量子化の決定を独立して行う。

論文の概要: Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

関連論文リスト