NeMo Toolbox for Speech Dataset Construction
- URL: http://arxiv.org/abs/2104.04896v1
- Date: Sun, 11 Apr 2021 01:57:55 GMT
- Title: NeMo Toolbox for Speech Dataset Construction
- Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
- Abstract summary: We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering.
We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks.
- Score: 11.494290433050624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a new toolbox for constructing speech datasets
from long audio recording and raw reference texts. We develop tools for each
step of the speech dataset construction pipeline including data preprocessing,
audio-text alignment, data post-processing and filtering. The proposed pipeline
also supports human-in-the-loop to address text-audio mismatch issues and
remove samples that don't satisfy the quality requirements. We demonstrated the
toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from
LibriVox audiobooks. The toolbox is opne sourced in NeMo framework. The RuLS
corpus is released in OpenSLR.
Related papers
- Spontaneous Informal Speech Dataset for Punctuation Restoration [0.8517406772939293]
We introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources.
Our filtering pipeline examines the quality of both speech audio and transcription text.
We also carefully construct a challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation.
arXiv Detail & Related papers (2024-09-17T14:43:14Z) - MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with
Depth Information [21.864200803678003]
This work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers.
To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed.
In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition.
arXiv Detail & Related papers (2023-06-04T05:00:12Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - textless-lib: a Library for Textless Spoken Language Processing [50.070693765984075]
We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area.
We describe the building blocks that the library provides and demonstrate its usability.
arXiv Detail & Related papers (2022-02-15T12:39:42Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - NeurST: Neural Speech Translation Toolkit [13.68036533544182]
NeurST is an open-source toolkit for neural speech translation developed by ByteDance AI Lab.
It mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products.
arXiv Detail & Related papers (2020-12-18T02:33:58Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.