ESPnet-SE++: Speech Enhancement for Robust Speech Recognition,
Translation, and Understanding
- URL: http://arxiv.org/abs/2207.09514v1
- Date: Tue, 19 Jul 2022 18:55:29 GMT
- Title: ESPnet-SE++: Speech Enhancement for Robust Speech Recognition,
Translation, and Understanding
- Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell,
Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu
Tsao, Yanmin Qian, Shinji Watanabe
- Abstract summary: This paper presents recent progress on integrating speech separation and enhancement into the ESPnet toolkit.
A new interface has been designed to combine speech enhancement front-ends with other tasks, including automatic speech recognition (ASR), speech translation (ST), and spoken language understanding (SLU)
Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR.
- Score: 86.47555696652618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents recent progress on integrating speech separation and
enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE
work, numerous features have been added, including recent state-of-the-art
speech enhancement models with their respective training and evaluation
recipes. Importantly, a new interface has been designed to flexibly combine
speech enhancement front-ends with other tasks, including automatic speech
recognition (ASR), speech translation (ST), and spoken language understanding
(SLU). To showcase such integration, we performed experiments on carefully
designed synthetic datasets for noisy-reverberant multi-channel ST and SLU
tasks, which can be used as benchmark corpora for future research. In addition
to these new tasks, we also use CHiME-4 and WSJ0-2Mix to benchmark multi- and
single-channel SE approaches. Results show that the integration of SE
front-ends with back-end tasks is a promising research direction even for tasks
besides ASR, especially in the multi-channel scenario. The code is available
online at https://github.com/ESPnet/ESPnet. The multi-channel ST and SLU
datasets, which are another contribution of this work, are released on
HuggingFace.
Related papers
- Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data [30.966072545451183]
We propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM.
We develop an effective data construction approach that splits and equips words from different languages to equip synthesiss with CS ability without relying on CS data.
arXiv Detail & Related papers (2024-09-17T08:11:07Z) - Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect [11.013934239276036]
Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks.
This paper contributes by comparing the effectiveness of SSL approaches in the context of the low-resource spoken Tunisian Arabic dialect.
arXiv Detail & Related papers (2024-07-05T14:21:36Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet [95.39817519115394]
ESPnet-SLU is a project inside end-to-end speech processing toolkit, ESPnet.
It is designed for quick development of spoken language understanding in a single framework.
arXiv Detail & Related papers (2021-11-29T17:05:49Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.