Fugu-MT 論文翻訳(概要): DeepSinger: Singing Voice Synthesis with Data Mined From the Web

論文の概要: DeepSinger: Singing Voice Synthesis with Data Mined From the Web

arxiv url: http://arxiv.org/abs/2007.04590v2
Date: Wed, 15 Jul 2020 14:37:45 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-12 05:00:06.747983
Title: DeepSinger: Singing Voice Synthesis with Data Mined From the Web
Title（参考訳）: DeepSinger:Webからのデータマイニングによる音声合成
Authors: Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu
Abstract要約: DeepSinger(ディープシンガー)は、音楽ウェブサイトから抽出された歌唱訓練データを用いて、スクラッチから構築された多言語歌唱音声合成システムである。 DeepSingerを3つの言語で89人の歌手から約92時間のデータからなるマイニングされた歌唱データセットで評価した。
参考スコア（独自算出の注目度）: 194.10598657846145
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)
Abstract（参考訳）: 本稿では,音楽Webサイトから抽出した歌唱訓練データを用いて,スクラッチから構築した多言語多言語歌唱音声合成(SVS)システムであるDeepSingerを開発する。 DeepSingerのパイプラインは、データクローリング、歌唱と伴奏分離、歌詞と歌唱のアライメント、データフィルタリング、歌唱モデリングなど、いくつかのステップで構成されている。具体的には,歌詞中の各音素の持続時間を自動的に抽出する歌詞間アライメントモデルを設計し,さらに,フィードフォワード変換器をベースとした多言語多言語歌唱モデルを設計し,歌詞から線形スペクトルを直接生成し,Griffin-Limを用いて音声を合成する。 DeepSingerは以前のSVSシステムよりもいくつかの利点がある。 1)私たちの知る限りでは、音楽ウェブサイトから直接トレーニングデータをマイニングする最初のSVSシステムである。 2)歌詞合成アライメントモデルは,アライメントラベリングに対する人間の努力をさらに回避し,ラベリングコストを大幅に削減する。 3) フィードフォワード変換器に基づく歌唱モデルは、パラメトリック合成における複雑な音響特徴モデリングを除去し、参照エンコーダを利用して、うるさい歌唱データから歌手の音色を捉え、シンプルかつ効率的である。 4)複数の言語と複数の歌手で歌声を合成することができる。 3つの言語(中国語、カント語、英語)の89人の歌手から約92時間のデータからなる、マイニングした歌唱データセットについてdeepsingerを評価した。その結果,Webから純粋に抽出された歌唱データにより,DeepSingerはピッチ精度と音声自然性の両方の観点から高品質な歌唱音声を合成できることがわかった(フットノート: 音声サンプルはhttps://speechresearch.github.io/deepsinger/)。

論文の概要: DeepSinger: Singing Voice Synthesis with Data Mined From the Web

関連論文リスト