Related papers: VocBench: A Neural Vocoder Benchmark for Speech Synthesis

VocBench: A Neural Vocoder Benchmark for Speech Synthesis

URL: http://arxiv.org/abs/2112.03099v1
Date: Mon, 6 Dec 2021 15:09:57 GMT
Title: VocBench: A Neural Vocoder Benchmark for Speech Synthesis
Authors: Ehab A. AlBadawy, Andrew Gibiansky, Qing He, Jilong Wu, Ming-Ching Chang, Siwei Lyu
Abstract summary: We present VocBench, a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them. Our results demonstrate that the framework is capable of showing the competitive efficacy and the quality of the synthesized samples for each vocoder.
Score: 36.94062576597112
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Neural vocoders, used for converting the spectral representations of an audio signal to the waveforms, are a commonly used component in speech synthesis pipelines. It focuses on synthesizing waveforms from low-dimensional representation, such as Mel-Spectrograms. In recent years, different approaches have been introduced to develop such vocoders. However, it becomes more challenging to assess these new vocoders and compare their performance to previous ones. To address this problem, we present VocBench, a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them. In our experiments, we use the same setup for datasets, training pipeline, and evaluation metrics for all neural vocoders. We perform a subjective and objective evaluation to compare the performance of each vocoder along a different axis. Our results demonstrate that the framework is capable of showing the competitive efficacy and the quality of the synthesized samples for each vocoder. VocBench framework is available at https://github.com/facebookresearch/vocoder-benchmark.

Related papers

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection [60.88800374832363]
Recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker.<n>We examine how different labeling choices affect detection performance and provide insights into labeling strategies.
arXiv Detail & Related papers (2026-02-18T10:29:07Z)
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument [19.395289629201056]
Token Synth is a novel neural synthesizer that generates audio tokens from MIDI tokens and CLAP embedding. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation.
arXiv Detail & Related papers (2025-02-13T03:40:30Z)
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge. We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
Universal Neural Vocoding with Parallel WaveNet [8.6698425961311]
We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases.
arXiv Detail & Related papers (2021-02-01T19:03:27Z)
RawNet: Fast End-to-End Neural Vocoder [4.507860128918788]
RawNet is a complete end-to-end neural vocoder based on the auto-encoder structure for speaker-dependent and -independent speech synthesis. It automatically learns to extract features and recover audio using neural networks, which include a coder network to capture a higher representation of the input audio and an autoregressive voder network to restore the audio in a sample-by-sample manner.
arXiv Detail & Related papers (2019-04-10T10:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.