Related papers: FOOCTTS: Generating Arabic Speech with Acoustic Environment for Football Commentator

FOOCTTS: Generating Arabic Speech with Acoustic Environment for Football Commentator

URL: http://arxiv.org/abs/2306.07936v1
Date: Wed, 7 Jun 2023 12:33:02 GMT
Title: FOOCTTS: Generating Arabic Speech with Acoustic Environment for Football Commentator
Authors: Massa Baali, Ahmed Ali
Abstract summary: The application gets the text from the user, applies text pre-processing such as vowelization, followed by the commentator's speech synthesizer. Our pipeline included Arabic automatic speech recognition for data labeling, CTC segmentation, transcription vowelization to match speech, and fine-tuning the TTS.
Score: 8.89134799076718
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper presents FOOCTTS, an automatic pipeline for a football commentator that generates speech with background crowd noise. The application gets the text from the user, applies text pre-processing such as vowelization, followed by the commentator's speech synthesizer. Our pipeline included Arabic automatic speech recognition for data labeling, CTC segmentation, transcription vowelization to match speech, and fine-tuning the TTS. Our system is capable of generating speech with its acoustic environment within limited 15 minutes of football commentator recording. Our prototype is generalizable and can be easily applied to different domains and languages.

Related papers

Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing [17.333427709985376]
We propose a data-driven method to automatically acquire pronunciation correlations, called automatic text pronunciation correlation (ATPC) Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing.
arXiv Detail & Related papers (2025-01-01T11:10:46Z)
Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z)
Zero-shot audio captioning with audio-language model guidance and audio context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training. Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z)
Direct Text to Speech Translation System using Acoustic Units [12.36988942647101]
This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.
arXiv Detail & Related papers (2023-09-14T07:35:14Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
A Textless Metric for Speech-to-Speech Comparison [20.658229254191266]
We introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units.
arXiv Detail & Related papers (2022-10-21T09:28:54Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation. Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction. In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)
Audio Adversarial Examples: Attacks Using Vocal Masks [0.0]
We construct audio adversarial examples on automatic Speech-To-Text systems. We produce an another by overlaying an audio vocal mask generated from the original audio. We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx.
arXiv Detail & Related papers (2021-02-04T05:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.