Fugu-MT 論文翻訳(概要): WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

論文の概要: WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

arxiv url: http://arxiv.org/abs/2605.06407v1
Date: Thu, 07 May 2026 15:17:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.935974
Title: WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
Title（参考訳）: WavCube:意味的・音響的関節モデリングによる理解・生成のための音声表現の統合
Authors: Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang, Wenxi Chen, Qi Chen, Wenrui Liu, Shan Yang, Xie Chen,
Abstract要約: WavCubeはSSL音声エンコーダから派生したコンパクトな連続ラテントである。言語理解、再構築、生成を同時にサポートする。試行では、最先端のゼロショットTSパフォーマンスと、トレーニングコンバージェンスを著しく高速化することを示している。
参考スコア（独自算出の注目度）: 35.33131758542107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
Abstract（参考訳）: 音声理解と生成の統合は、統合された音声モデルを構築するための重要なステップである。しかしながら、これらの2つのタスクに要求される異なる表現は、現在、大きな互換性上の課題を生じさせている。通常、セマンティクス指向の機能は自己教師付き学習(SSL)から学習され、音響指向の機能は再構成から学習される。このような断片化表現は、真に統一された音声システムの実現を妨げる。本稿では,音声理解,再構築,生成を同時にサポートするSSL音声エンコーダから導出したコンパクトな連続ラテントであるWavCubeについて述べる。 WavCubeは2段階のトレーニングスキームを採用している。ステージ1は意味的なボトルネックをトレーニングし、オフマンド冗長性をフィルタリングすることで、生のSSL機能を拡散しやすくする。ステージ2は、エンド・ツー・エンドの再構成によって微細な音響的詳細を注入する一方、セマンティックアンカリング損失は、その表現が元のセマンティック・多様体の中に埋もれていることを保証している。総合的な実験により、WavCubeは8次元圧縮にもかかわらず SUPERB上でのWavLM性能に近づき、既存の音響表現に匹敵する再現品質を達成し、訓練収束を著しく高速化し、SUPERB-SGベンチマークでの音声強調、分離、音声変換タスクに優れ、最先端のゼロショットTTS性能を提供することが示された。体系的なアブリケーションにより、WavCubeの2段階のレシピは、生成的モデリングのためのSSL機能の固有の2つの欠陥を解決し、将来の統一音声システムへの道を開いた。コードとチェックポイントはhttps://github.com/yanghaha0908/WavCube.comで入手できる。

論文の概要: WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

関連論文リスト