Fugu-MT 論文翻訳(概要): Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

論文の概要: Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

arxiv url: http://arxiv.org/abs/2605.28642v1
Date: Wed, 27 May 2026 15:47:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.187298
Title: Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation
Title（参考訳）: 帯域幅効率とプライバシ保護によるエッジクラウド多対多音声翻訳
Authors: Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Ming Liu, Bing Qin, Yang Xiang,
Abstract要約: Edge-cloud Speech Recognition and Translation (ESRT)は、プライバシー保護と帯域幅効率の協調型エッジクラウドMLLMフレームワークである。我々は、軽量な音声エンコーダとアダプタをデバイス上に保持し、高度に圧縮された中間機能のみをクラウドに送信するエッジクラウド分割推論アーキテクチャを設計する。英語中心のボトルネックを克服するために,データバランシングによる多タスク重み付きカリキュラム学習戦略を導入し,堅牢な言語間一貫性を実現する。
参考スコア（独自算出の注目度）: 38.38807634557459
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.
Abstract（参考訳）: MLLM(Multimodal large language model)は、音声からテキストへの翻訳(S2TT)において重要な可能性を示している。しかし、既存のデプロイメントパラダイムは重要な課題に直面している。純粋なオンデバイスモデルはリソースの制約に悩まされ、集中型のクラウドシステムは生の音声データを送信することで、深刻なプライバシーリスクと帯域幅のボトルネックを引き起こす。さらに、ほとんどのモデルは英語中心のバイアスを示し、多対多の翻訳スケーリングを制限する。本稿では,プライバシー保護と帯域幅効率の両立するエッジクラウドMLLMフレームワークであるエッジクラウド音声認識・翻訳(ESRT)を提案する。具体的には、軽量な音声エンコーダとアダプタをデバイス上に保持し、高度に圧縮された中間機能のみをクラウドに送信するエッジクラウド分割推論アーキテクチャを設計する。これにより、ボイスプリントの漏洩を防ぎ、帯域幅の要求を最大10$\times$に削減できる。英語中心のボトルネックを克服するために,データバランシングによる多タスク重み付きカリキュラム学習戦略を導入し,堅牢な言語間一貫性を実現する。 FLEURSデータセットの大規模な実験により、我々のモデルであるESRT-4BとESRT-12Bが45言語(45 \times 44$ directions)にわたる最先端の多対多のS2TT性能を達成することが示された。コードとモデルは、再現可能でプライバシーに配慮したMLLM S2TT研究を促進するためにリリースされている。コードとモデルはhttps://github.com/yxduir/esrt.comで公開されている。

論文の概要: Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

関連論文リスト