FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
for Neural Speech Codec
- URL: http://arxiv.org/abs/2309.07405v1
- Date: Thu, 14 Sep 2023 03:18:24 GMT
- Title: FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
for Neural Speech Codec
- Authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
- Abstract summary: This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec.
Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
- Score: 55.95078490630001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit,
which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the
latest neural speech codec models, such as SoundStream and Encodec. Thanks to
the unified design with FunASR, FunCodec can be easily integrated into
downstream tasks, such as speech recognition. Along with FunCodec, pre-trained
models are also provided, which can be used for academic or generalized
purposes. Based on the toolkit, we further propose the frequency-domain codec
models, FreqCodec, which can achieve comparable speech quality with much lower
computation and parameter complexity. Experimental results show that, under the
same compression ratio, FunCodec can achieve better reconstruction quality
compared with other toolkits and released models. We also demonstrate that the
pre-trained models are suitable for downstream tasks, including automatic
speech recognition and personalized text-to-speech synthesis. This toolkit is
publicly available at https://github.com/alibaba-damo-academy/FunCodec.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound [40.810505707522324]
SemantiCodec is designed to compress audio into fewer than a hundred tokens per second across diverse audio types.
We show that SemantiCodec significantly outperforms the state-of-the-art Descript on reconstruction quality.
Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs.
arXiv Detail & Related papers (2024-04-30T22:51:36Z) - PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders [6.375882733058943]
We propose PromptCodec, a novel end-to-end neural speech using feature-aware prompt encoders.
Our proposed PromptCodec consistently outperforms state-of-theart neural speech models under all different conditions.
arXiv Detail & Related papers (2024-04-03T13:00:08Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Enhancing into the codec: Noise Robust Speech Coding with
Vector-Quantized Autoencoders [21.74276379834421]
We develop compressor-enhancer encoders and accompanying decoders based on VQ-VAE autoencoders with WaveRNN decoders.
We observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.
arXiv Detail & Related papers (2021-02-12T16:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.