Related papers: FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

URL: http://arxiv.org/abs/2309.07405v1
Date: Thu, 14 Sep 2023 03:18:24 GMT
Title: FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Authors: Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
Abstract summary: This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
Score: 55.95078490630001
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.

Related papers

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech. Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z)
LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec [14.7377193484733]
We propose LSCodec, a discrete speech that has both low and speaker decoupling ability. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines.
arXiv Detail & Related papers (2024-10-21T08:23:31Z)
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second. We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z)
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model [36.61105228468503]
X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization stage. X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation.
arXiv Detail & Related papers (2024-08-30T10:24:07Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders [21.74276379834421]
We develop compressor-enhancer encoders and accompanying decoders based on VQ-VAE autoencoders with WaveRNN decoders. We observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.
arXiv Detail & Related papers (2021-02-12T16:42:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.