Fugu-MT 論文翻訳(概要): BareWave: Waveform-Native Flow-Matching Text-to-Speech

論文の概要: BareWave: Waveform-Native Flow-Matching Text-to-Speech

arxiv url: http://arxiv.org/abs/2606.09048v1
Date: Mon, 08 Jun 2026 05:36:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.714649
Title: BareWave: Waveform-Native Flow-Matching Text-to-Speech
Title（参考訳）: BareWave:Waveform-Native Flow-Matching Text-to-Speech
Authors: Wei Fan, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li, Kejiang Chen, Weiming Zhang, Nenghai Yu,
Abstract要約: フローマッチングTTSにおける直接テキスト・ツー・ウェーブ生成のための,完全な波形ネイティブフレームワークであるBareWaveを提案する。我々は、トレーニング時表現アライメント、ステージドノイズスケジューリング、速度認識の知覚アライメントを組み合わせたダイレクトテキスト・ツー・ウェーブ・トレーニング・フレームワークを開発した。ゼロショット音声クローニングの実験では、完全な波形ネイティブな推論パスの下で、強い知性、話者類似性、自然性が達成できることが示されている。
参考スコア（独自算出の注目度）: 76.5390412686083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.
Abstract（参考訳）: 中間表現の除去と個別に訓練された復号段階は、生成モデリングにおいて重要な方向となっている。しかし、テキストから音声への変換では、高品質なシステムは波形合成の前に中間的な音響表現によって構築されることが多い。本研究では,フローマッチングTTSにおける直接テキスト・ツー・ウェーブ生成のための完全波形ネイティブフレームワークであるBareWaveを提案する。生波形モデリングには、強い事前訓練された表現の足場が欠如しており、異なるノイズスケジュールから異なる訓練段階の恩恵を受けており、データ空間の知覚的目的は、速度空間フローの時間的構造を自動で共有しない。その結果、直接波形トレーニングは効率よく最適化し難く、固定されたレシピで強い最終動作点に向かって押し出すのが困難であり、効果的な知覚的改善を統合するのが困難である。そこで本研究では,テスト時に予めトレーニングされたコンポーネントを使わずに単一波形-ネイティブ推論パスを保ちながら,トレーニング時適応アライメント,ステージドノイズスケジューリング,ベロシティ-アウェア・パーセプティブアライメント(VAPA)を組み合わせた直接テキスト間トレーニングフレームワークを開発した。ゼロショット音声クローニング実験は、波形ネイティブな流れマッチングTTSを実用的な方向としてサポートし、完全に波形ネイティブな推論パスの下で、強い知性、話者類似性、自然性を達成可能であることを示した。オーディオデモのあるプロジェクトページはhttps://barewave.github.io/.com/で公開されている。

論文の概要: BareWave: Waveform-Native Flow-Matching Text-to-Speech

関連論文リスト