Fugu-MT 論文翻訳(概要): Entropy-based Coarse and Compressed Semantic Speech Representation Learning

論文の概要: Entropy-based Coarse and Compressed Semantic Speech Representation Learning

arxiv url: http://arxiv.org/abs/2509.00503v1
Date: Sat, 30 Aug 2025 13:50:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.260593
Title: Entropy-based Coarse and Compressed Semantic Speech Representation Learning
Title（参考訳）: エントロピーに基づく粗大度と圧縮された意味的音声表現学習
Authors: Jialong Zuo, Guangyan Zhang, Minghui Fang, Shengpeng Ji, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Zhou Zhao,
Abstract要約: 圧縮された意味表現を学習するためのエントロピーに基づく動的集約フレームワークを提案する。 ASR、音声からテキストへの変換、音声変換タスクの実験は、圧縮された表現が密度の高いトークンシーケンスと同等以上のパフォーマンスを示すことを示した。
参考スコア（独自算出の注目度）: 72.18542411704347
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.
Abstract（参考訳）: 離散表現学習は近年,音響モデルと意味モデルの両方への関心が高まっている。既存のアプローチは通常、16kHzの波形を毎秒25または50の速度で離散トークンに符号化する。しかし、音声が1秒に2語から5語しか話さないことを考えると、このようなきめ細かいトークン化は冗長性をもたらし、下流での訓練や推論の効率を阻害する。さらに、この周波数でのセマンティック音声表現は、主に音素レベルの情報をキャプチャするが、セマンティック理解はそのような詳細なトークンレベルの解決を必要としないかもしれない。これらの制約に対処するために,圧縮された意味表現を学習するためのエントロピーに基づく動的集約フレームワークを提案する。音声言語モデルは,大規模未ラベルデータの次点予測を用いて事前学習を行い,頻繁なトークンパターンを抽出する。予測エントロピーは、アグリゲーション境界を適応的に決定するために使用され、次に各セグメント内の情報を融合するクロスアテンションモジュールが続く。エントロピー閾値を調整することにより、表現の粒度及び圧縮比を柔軟に制御することができる。 ASR, 音声からテキストへの翻訳, 音声変換タスクの実験により, 圧縮された表現が高密度なトークン列と同等以上の性能を示し, 提案手法の有効性を実証した。

論文の概要: Entropy-based Coarse and Compressed Semantic Speech Representation Learning

関連論文リスト