Fugu-MT 論文翻訳(概要): Large-scale unsupervised audio pre-training for video-to-speech synthesis

論文の概要: Large-scale unsupervised audio pre-training for video-to-speech synthesis

arxiv url: http://arxiv.org/abs/2306.15464v2
Date: Mon, 31 Jul 2023 12:09:18 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-01 20:33:05.257843
Title: Large-scale unsupervised audio pre-training for video-to-speech synthesis
Title（参考訳）: ビデオ音声合成のための大規模教師なし音声事前学習
Authors: Triantafyllos Kefalas, Yannis Panagakis, Maja Pantic
Abstract要約: 音声合成は、話者の無声映像から音声信号を再構成する作業である。本稿では,24kHzで3,500時間以上のオーディオデータをエンコーダ・デコーダモデルでトレーニングすることを提案する。次に、事前学習したデコーダを用いて、音声合成タスクの音声デコーダを初期化する。
参考スコア（独自算出の注目度）: 64.86087257004883
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community over the years. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this pre-training step improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.
Abstract（参考訳）: 音声合成は、話者の無声映像から音声信号を再構成する作業である。現在確立されているほとんどのアプローチは、2段階のプロセスであり、ビデオからの中間表現であるスペクトログラムが最初に抽出され、次にヴォコーダに渡されて生のオーディオが生成される。最近の研究は、生音声と任意の中間表現の生成を共同で行うエンドツーエンド合成に焦点を当てている。これらのアプローチはすべて、ほぼ独占的なオーディオ-視覚データセットのデータ、すなわち、すべてのオーディオサンプルが対応するビデオサンプルを持つデータのトレーニングを含む。これにより、対応する視覚的モダリティ(例えば、オーディオブック、ラジオポッドキャスト、音声認識データセットなど)を持たない豊富なオーディオのみのデータセットや、オーディオ機械学習コミュニティによって長年開発されてきたオーディオのみのアーキテクチャの使用が妨げられる。本稿では,24khzで3500時間以上の音声データに対してエンコーダ・デコーダモデルをトレーニングし,事前学習したデコーダを用いて音声デコーダを初期化する手法を提案する。事前学習ステップは、オーディオサンプルのみを使用し、他のモダリティ(視覚、テキスト)からのラベルや対応するサンプルを必要としない。我々は,この事前学習段階が再構成された音声を改善し,一方のモダリティからのサンプルを必要とせず,クロスモーダルタスクにおける生成器の品質を向上させるための未熟な方法であることを実証する。ターゲット出力として生オーディオとメルスペクトログラムの両方を用いて実験を行い、既存の作業でモデルをベンチマークする。

論文の概要: Large-scale unsupervised audio pre-training for video-to-speech synthesis

関連論文リスト