Fugu-MT 論文翻訳(概要): Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

論文の概要: Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

arxiv url: http://arxiv.org/abs/2510.04738v1
Date: Mon, 06 Oct 2025 12:11:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.842096
Title: Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba
Title（参考訳）: Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba
Authors: Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis,
Abstract要約: MAVEは、テキスト条件付き音声編集と高忠実な音声合成のための新しい自動回帰アーキテクチャである。 MAVEは、音声編集における最先端のパフォーマンスと、ゼロショットTSにおける非常に競合的な結果を達成する。 MAVEは、RealEditデータベースからの発話を推測する際に、VoiceCraftよりも6倍少ないメモリを必要とする。
参考スコア（独自算出の注目度）: 5.055749974859193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.
Abstract（参考訳）: テキスト条件付き音声編集と高忠実テキスト音声合成のための新しい自己回帰アーキテクチャであるMAVE(Mamba with Cross-Attention for Voice Editing and Synthesis)を紹介する。 MAVEは、音声編集における最先端のパフォーマンスとゼロショットTSにおける非常に競争的な結果を達成するが、後者のタスクでは明示的に訓練されていない。高精度なテキスト・音響アライメントのための効率的な音声シーケンスモデリングのためのMambaを統合することで、MAVEはコンテキスト認識音声編集を可能にする。 RealEditベンチマークの40サンプルのランダムなサブセット(400判定)での人間による評価では、57.2%のリスナーがMAVE - 編集された音声を原語と知覚的に等しいものとして評価し、24.8%は原語と18.0%のMAVEを好んだ。 MAVEはVoiceCraftとFluentSpeechをペア比較とスタンドアローン平均評価スコア(MOS)評価で比較する。ゼロショットTSでは、MAVEは複数の推論実行や後処理を必要とせず、話者の類似性と自然性の両方においてVoiceCraftを上回っている。 MAVEは、RealEditデータベースからの発話(平均時間:6.21秒、A100、FP16、バッチサイズ1)の推測において、VoiceCraftよりも6倍少ないメモリを必要とする。以上の結果から,MAVEは構造化状態空間モデリングとクロスモーダルアテンションの相乗的統合により,フレキシブルで高忠実な音声編集と合成のための新しい標準を確立していることが示された。

論文の概要: Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

関連論文リスト