Fugu-MT 論文翻訳(概要): Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

論文の概要: Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

arxiv url: http://arxiv.org/abs/2512.06689v1
Date: Sun, 07 Dec 2025 06:48:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.469877
Title: Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation
Title（参考訳）: 統一音声強調分離のための軽量ワッサースタイン音響画像モデル
Authors: Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon,
Abstract要約: 音声強調(SE)と音声分離(SS)は伝統的に、音声処理において異なるタスクとして扱われてきた。単一モデルでSEとSSを統一する軽量かつ教師なしオーディオ視覚フレームワークUniVoiceLiteを提案する。 UniVoiceLiteはノイズとマルチスピーカの両方のシナリオで高いパフォーマンスを実現し、効率と堅牢な一般化を組み合わせた。
参考スコア（独自算出の注目度）: 26.48174619097384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.
Abstract（参考訳）: 音声強調(SE)と音声分離(SS)は伝統的に、音声処理において異なるタスクとして扱われてきた。しかし、現実のオーディオは、しばしば背景ノイズと重なり合うスピーカーの両方を伴い、統一されたソリューションの必要性を動機付けている。近年のアプローチはSEとSSを多段階アーキテクチャに統合しようと試みているが、これらのアプローチは一般的に複雑でパラメータの多いモデルを含み、教師付きトレーニング、スケーラビリティと一般化の制限に依存している。本研究では,単一モデルでSEとSSを統一する軽量かつ教師なしオーディオ視覚フレームワークUniVoiceLiteを提案する。 UniVoiceLiteは、唇の動きと顔の同一性を利用して音声抽出を誘導し、ワッサーシュタイン距離正規化を用いて、ペアノイズクリーンデータを必要としない遅延空間を安定化する。実験結果から、UniVoiceLiteは雑音とマルチスピーカの両方のシナリオで高い性能を達成し、効率と堅牢な一般化を組み合わせていることがわかった。ソースコードはhttps://github.com/jisoo-o/UniVoiceLite.comで入手できる。

論文の概要: Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

関連論文リスト