Speech enhancement aided end-to-end multi-task learning for voice
activity detection
- URL: http://arxiv.org/abs/2010.12484v3
- Date: Tue, 13 Apr 2021 08:03:25 GMT
- Title: Speech enhancement aided end-to-end multi-task learning for voice
activity detection
- Authors: Xu Tan, Xiao-Lei Zhang
- Abstract summary: Speech enhancement is helpful to voice activity detection (VAD), but the performance improvement is limited.
We propose a speech enhancement aided end-to-end multi-task model for VAD.
mSI-SDR uses VAD information to mask the output of the speech enhancement decoder in the training process.
- Score: 40.44466027163059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robust voice activity detection (VAD) is a challenging task in low
signal-to-noise (SNR) environments. Recent studies show that speech enhancement
is helpful to VAD, but the performance improvement is limited. To address this
issue, here we propose a speech enhancement aided end-to-end multi-task model
for VAD. The model has two decoders, one for speech enhancement and the other
for VAD. The two decoders share the same encoder and speech separation network.
Unlike the direct thought that takes two separated objectives for VAD and
speech enhancement respectively, here we propose a new joint optimization
objective -- VAD-masked scale-invariant source-to-distortion ratio (mSI-SDR).
mSI-SDR uses VAD information to mask the output of the speech enhancement
decoder in the training process. It makes the VAD and speech enhancement tasks
jointly optimized not only at the shared encoder and separation network, but
also at the objective level. It also satisfies real-time working requirement
theoretically. Experimental results show that the multi-task method
significantly outperforms its single-task VAD counterpart. Moreover, mSI-SDR
outperforms SI-SDR in the same multi-task setting.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z) - Joint speech and overlap detection: a benchmark over multiple audio
setup and speech domains [0.0]
VAD and OSD can be trained jointly using a multi-class classification model.
This paper proposes a complete and new benchmark of different VAD and OSD models.
Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results.
arXiv Detail & Related papers (2023-07-24T14:29:21Z) - Improving Code-Switching and Named Entity Recognition in ASR with Speech
Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model.
The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z) - Encoder-decoder multimodal speaker change detection [15.290910973040152]
Speaker change detection (SCD) is essential for several applications.
multimodal SCD models, which utilise text modality in addition to audio, have shown improved performance.
This study builds upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture.
arXiv Detail & Related papers (2023-06-01T13:55:23Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD [10.168591454648123]
This paper presents a novel end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and VAD into one model.
The proposed system, which we refer to as Long-Running Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate task-specific datasets in the training stage.
In the inference stage, the LR-SR system removes non-speech parts at low computational cost and recognizes speech parts with high robustness.
arXiv Detail & Related papers (2021-03-02T11:49:03Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Discriminative Multi-modality Speech Recognition [17.296404414250553]
Vision is often used as a complementary modality for audio speech recognition (ASR)
In this paper, we propose a two-stage speech recognition model.
In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly.
At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.
arXiv Detail & Related papers (2020-05-12T07:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.