FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
for Neural Speech Codec
- URL: http://arxiv.org/abs/2309.07405v1
- Date: Thu, 14 Sep 2023 03:18:24 GMT
- Title: FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
for Neural Speech Codec
- Authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
- Abstract summary: This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec.
Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
- Score: 55.95078490630001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit,
which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the
latest neural speech codec models, such as SoundStream and Encodec. Thanks to
the unified design with FunASR, FunCodec can be easily integrated into
downstream tasks, such as speech recognition. Along with FunCodec, pre-trained
models are also provided, which can be used for academic or generalized
purposes. Based on the toolkit, we further propose the frequency-domain codec
models, FreqCodec, which can achieve comparable speech quality with much lower
computation and parameter complexity. Experimental results show that, under the
same compression ratio, FunCodec can achieve better reconstruction quality
compared with other toolkits and released models. We also demonstrate that the
pre-trained models are suitable for downstream tasks, including automatic
speech recognition and personalized text-to-speech synthesis. This toolkit is
publicly available at https://github.com/alibaba-damo-academy/FunCodec.
Related papers
- LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec [14.7377193484733]
We propose LSCodec, a discrete speech that has both low and speaker decoupling ability.
By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines.
arXiv Detail & Related papers (2024-10-21T08:23:31Z) - Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second.
We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z) - Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage.
X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications.
Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z) - CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Enhancing into the codec: Noise Robust Speech Coding with
Vector-Quantized Autoencoders [21.74276379834421]
We develop compressor-enhancer encoders and accompanying decoders based on VQ-VAE autoencoders with WaveRNN decoders.
We observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.
arXiv Detail & Related papers (2021-02-12T16:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.