The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge
- URL: http://arxiv.org/abs/2410.23815v1
- Date: Thu, 31 Oct 2024 10:58:59 GMT
- Title: The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge
- Authors: Dake Guo, Jixun Yao, Xinfa Zhu, Kangxiang Xia, Zhao Guo, Ziyu Zhang, Yao Wang, Jie Liu, Lei Xie,
- Abstract summary: This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC)
Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2.
- Score: 12.862628838633396
- License:
- Abstract: This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC). Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2. In Track 1, we employ Single-Codec to tokenize the speech into discrete tokens and use a language-model-based approach to achieve zero-shot speaking style cloning. The Single-Codec effectively decouples timbre and speaking style at the token level, reducing the acoustic modeling burden on the autoregressive language model. Additionally, we use DSPGAN to upsample 16 kHz mel-spectrograms to high-fidelity 48 kHz waveforms. In Track 2, we propose a background audio generator based on large language models (LLMs). This system produces scene-appropriate accompaniment descriptions, synthesizes background audio with Tango 2, and integrates it with the speech generated by our Track 1 system. Our submission achieves the second place and the first place in Track 1 and Track 2 respectively.
Related papers
- TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - The FruitShell French synthesis system at the Blizzard 2023 Challenge [12.459890525109646]
This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023.
The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals.
arXiv Detail & Related papers (2023-09-01T02:56:20Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - SoundStorm: Efficient Parallel Audio Generation [27.121920017380273]
We present SoundStorm, a model for efficient, non-autoregressive audio generation.
SoundStorm receives as semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding.
We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments.
arXiv Detail & Related papers (2023-05-16T17:41:25Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - The AS-NU System for the M2VoC Challenge [49.12981125333458]
This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC)
The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning.
Due to the serious lack of data in the second track, we selected the speaker most similar to the target speaker from the training data of the TTS system, and used the speaker's utterances and the given 5 target utterances to fine-tune our model.
arXiv Detail & Related papers (2021-04-07T09:26:20Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.