Related papers: NeMo Toolbox for Speech Dataset Construction

NeMo Toolbox for Speech Dataset Construction

URL: http://arxiv.org/abs/2104.04896v1
Date: Sun, 11 Apr 2021 01:57:55 GMT
Title: NeMo Toolbox for Speech Dataset Construction
Authors: Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg
Abstract summary: We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering. We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks.
Score: 11.494290433050624
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a new toolbox for constructing speech datasets from long audio recording and raw reference texts. We develop tools for each step of the speech dataset construction pipeline including data preprocessing, audio-text alignment, data post-processing and filtering. The proposed pipeline also supports human-in-the-loop to address text-audio mismatch issues and remove samples that don't satisfy the quality requirements. We demonstrated the toolbox efficiency by building the Russian LibriSpeech corpus (RuLS) from LibriVox audiobooks. The toolbox is opne sourced in NeMo framework. The RuLS corpus is released in OpenSLR.

Related papers

Spontaneous Informal Speech Dataset for Punctuation Restoration [0.8517406772939293]
We introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation.
arXiv Detail & Related papers (2024-09-17T14:43:14Z)
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information [21.864200803678003]
This work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition.
arXiv Detail & Related papers (2023-06-04T05:00:12Z)
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
textless-lib: a Library for Textless Spoken Language Processing [50.070693765984075]
We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability.
arXiv Detail & Related papers (2022-02-15T12:39:42Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
NeurST: Neural Speech Translation Toolkit [13.68036533544182]
NeurST is an open-source toolkit for neural speech translation developed by ByteDance AI Lab. It mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products.
arXiv Detail & Related papers (2020-12-18T02:33:58Z)
"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet. It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.