The IMS Toucan System for the Blizzard Challenge 2023
- URL: http://arxiv.org/abs/2310.17499v1
- Date: Thu, 26 Oct 2023 15:53:29 GMT
- Title: The IMS Toucan System for the Blizzard Challenge 2023
- Authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler,
Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu
- Abstract summary: For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021.
Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language.
A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave.
- Score: 25.460791056978895
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: For our contribution to the Blizzard Challenge 2023, we improved on the
system we submitted to the Blizzard Challenge 2021. Our approach entails a
rule-based text-to-phoneme processing system that includes rule-based
disambiguation of homographs in the French language. It then transforms the
phonemes to spectrograms as intermediate representations using a fast and
efficient non-autoregressive synthesis architecture based on Conformer and
Glow. A GAN based neural vocoder that combines recent state-of-the-art
approaches converts the spectrogram to the final wave. We carefully designed
the data processing, training, and inference procedures for the challenge data.
Our system identifier is G. Open source code and demo are available.
Related papers
- Autoregressive Large Language Models are Computationally Universal [59.34397993748194]
We show that autoregressive decoding of a transformer-based language model can realize universal computation.
We first show that a universal Turing machine can be simulated by a Lag system with 2027 production rules.
We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer.
arXiv Detail & Related papers (2024-10-04T06:05:17Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Hierarchical Audio-Visual Information Fusion with Multi-label Joint
Decoding for MER 2023 [51.95161901441527]
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions.
Deep features extracted from foundation models are used as robust acoustic and visual representations of raw video.
Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
arXiv Detail & Related papers (2023-09-11T03:19:10Z) - The FruitShell French synthesis system at the Blizzard 2023 Challenge [12.459890525109646]
This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023.
The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals.
arXiv Detail & Related papers (2023-09-01T02:56:20Z) - Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content.
We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - The ReprGesture entry to the GENEA Challenge 2022 [8.081712389287903]
This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022.
The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems.
arXiv Detail & Related papers (2022-08-25T14:50:50Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge [2.675158177232256]
This paper describes joint effort of BUT and Telef'onica Research on development of Automatic Speech Recognition systems.
We compare approaches based on either hybrid or end-to-end models.
A fusion of our best systems achieved 23.33% WER in official Albayzin 2020 evaluations.
arXiv Detail & Related papers (2021-01-29T18:40:54Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis:
ZeroSpeech 2020 Challenge [27.314082075933197]
The ZeroSpeech 2020 challenge is to build a speech without any textual information or phonetic labels.
We build a system that must address two major components such as 1) given speech audio, extract subword units in an unsupervised way and 2) re-synthesize the audio from novel speakers.
Our main contribution here is we proposed Transformer-based VQ-VAE for unsupervised unit discovery and Transformer-based inverter for the speech synthesis given the extracted codebook.
arXiv Detail & Related papers (2020-05-24T07:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.