RECOApy: Data recording, pre-processing and phonetic transcription for
end-to-end speech-based applications
- URL: http://arxiv.org/abs/2009.05493v2
- Date: Tue, 15 Sep 2020 09:07:33 GMT
- Title: RECOApy: Data recording, pre-processing and phonetic transcription for
end-to-end speech-based applications
- Authors: Adriana Stan
- Abstract summary: RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications.
The tool implements an easy-to-use interface for prompted speech recording, spectrogram and waveform analysis, utterance-level normalisation and silence trimming.
The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based architectures trained on lexicons extracted from the Wiktionary online collaborative resource.
- Score: 4.619541348328938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning enables the development of efficient end-to-end speech
processing applications while bypassing the need for expert linguistic and
signal processing features. Yet, recent studies show that good quality speech
resources and phonetic transcription of the training data can enhance the
results of these applications. In this paper, the RECOApy tool is introduced.
RECOApy streamlines the steps of data recording and pre-processing required in
end-to-end speech-based applications. The tool implements an easy-to-use
interface for prompted speech recording, spectrogram and waveform analysis,
utterance-level normalisation and silence trimming, as well grapheme-to-phoneme
conversion of the prompts in eight languages: Czech, English, French, German,
Italian, Polish, Romanian and Spanish.
The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based
architectures trained on lexicons extracted from the Wiktionary online
collaborative resource. With the different degree of orthographic transparency,
as well as the varying amount of phonetic entries across the languages, the
DNN's hyperparameters are optimised with an evolution strategy. The phoneme and
word error rates of the resulting G2P converters are presented and discussed.
The tool, the processed phonetic lexicons and trained G2P models are made
freely available.
Related papers
- LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study [2.8948274245812327]
Grapheme-to-phoneme (G2P) conversion is critical in speech processing.
Large language models (LLMs) have recently demonstrated significant potential in various language tasks.
We present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language.
arXiv Detail & Related papers (2024-09-13T06:13:55Z) - Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST)
GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation.
We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.