ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
- URL: http://arxiv.org/abs/2307.16071v2
- Date: Wed, 27 Mar 2024 08:56:01 GMT
- Title: ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
- Authors: Tolulope Ogunremi, Kola Tubosun, Anuoluwapo Aremu, Iroro Orife, David Ifeoluwa Adelani,
- Abstract summary: IroyinSpeech is a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorub'a speech data.
We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0.
- Score: 7.97238074132292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce \`{I}r\`{o}y\`{i}nSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yor\`{u}b\'{a} speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0. To encourage a participatory approach to data creation, we provide 5000 curated sentences to the Mozilla Common Voice platform to crowd-source the recording and validation of Yor\`{u}b\'{a} speech data. In total, we created about 42 hours of speech data recorded by 80 volunteers in-house, and 6 hours of validated recordings on Mozilla Common Voice platform. Our TTS evaluation suggests that a high-fidelity, general domain, single-speaker Yor\`{u}b\'{a} voice is possible with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline word error rate (WER) of 23.8.
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Voxtlm: unified decoder-only models for consolidating speech
recognition/synthesis and speech/text continuation tasks [61.3055230762097]
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.
VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning.
arXiv Detail & Related papers (2023-09-14T03:13:18Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - IMaSC -- ICFOSS Malayalam Speech Corpus [0.0]
We present IMaSC, a Malayalam text and speech corpora containing approximately 50 hours of recorded speech.
With 8 speakers and a total of 34,473 text-audio pairs, IMaSC is larger than every other publicly available alternative.
We show that our models perform significantly better in terms of naturalness compared to previous studies and publicly available models, with an average mean opinion score of 4.50.
arXiv Detail & Related papers (2022-11-23T09:21:01Z) - Maestro-U: Leveraging joint speech-text representation learning for zero
supervised speech ASR [39.59611707268663]
We show that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised speech for some languages.
We show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap.
arXiv Detail & Related papers (2022-10-18T17:50:31Z) - RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis [3.6406488220483317]
RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems.
It contains over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz.
arXiv Detail & Related papers (2021-06-15T22:24:38Z) - GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio [88.20960848885575]
GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training.
Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles.
For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
arXiv Detail & Related papers (2021-06-13T04:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.