NeMo Toolbox for Speech Dataset Construction
- URL: http://arxiv.org/abs/2104.04896v1
- Date: Sun, 11 Apr 2021 01:57:55 GMT
- Title: NeMo Toolbox for Speech Dataset Construction
- Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
- Abstract summary: We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering.
We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks.
- Score: 11.494290433050624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a new toolbox for constructing speech datasets
from long audio recording and raw reference texts. We develop tools for each
step of the speech dataset construction pipeline including data preprocessing,
audio-text alignment, data post-processing and filtering. The proposed pipeline
also supports human-in-the-loop to address text-audio mismatch issues and
remove samples that don't satisfy the quality requirements. We demonstrated the
toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from
LibriVox audiobooks. The toolbox is opne sourced in NeMo framework. The RuLS
corpus is released in OpenSLR.
Related papers
- A Large-scale Dataset for Audio-Language Representation Learning [54.933479346870506]
We present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs.
We construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with
Depth Information [21.864200803678003]
This work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers.
To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed.
In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition.
arXiv Detail & Related papers (2023-06-04T05:00:12Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - textless-lib: a Library for Textless Spoken Language Processing [50.070693765984075]
We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area.
We describe the building blocks that the library provides and demonstrate its usability.
arXiv Detail & Related papers (2022-02-15T12:39:42Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - HUI-Audio-Corpus-German: A high quality TTS dataset [0.0]
"HUI-Audio-Corpus-German" is a large, open-source dataset for TTS engines, created with a processing pipeline.
This dataset produces high quality audio to transcription alignments and decreases manual effort needed for creation.
arXiv Detail & Related papers (2021-06-11T10:59:09Z) - NeurST: Neural Speech Translation Toolkit [13.68036533544182]
NeurST is an open-source toolkit for neural speech translation developed by ByteDance AI Lab.
It mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products.
arXiv Detail & Related papers (2020-12-18T02:33:58Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.