End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection
- URL: http://arxiv.org/abs/2002.00551v2
- Date: Fri, 14 Feb 2020 06:15:58 GMT
- Title: End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection
- Authors: Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda and Shinji Watanabe
- Abstract summary: This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
- Score: 48.80449801938696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper integrates a voice activity detection (VAD) function with
end-to-end automatic speech recognition toward an online speech interface and
transcribing very long audio recordings. We focus on connectionist temporal
classification (CTC) and its extension of CTC/attention architectures. As
opposed to an attention-based architecture, input-synchronous label prediction
can be performed based on a greedy search with the CTC (pre-)softmax output.
This prediction includes consecutive long blank labels, which can be regarded
as a non-speech region. We use the labels as a cue for detecting speech
segments with simple thresholding. The threshold value is directly related to
the length of a non-speech region, which is more intuitive and easier to
control than conventional VAD hyperparameters. Experimental results on
unsegmented data show that the proposed method outperformed the baseline
methods using the conventional energy-based and neural-network-based VAD
methods and achieved an RTF less than 0.2. The proposed method is publicly
available.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - Multistream neural architectures for cued-speech recognition using a
pre-trained visual feature extractor and constrained CTC decoding [0.0]
Cued Speech (CS) is a visual communication tool that helps people with hearing impairment to understand spoken language.
The proposed approach is based on a pre-trained hand and lips tracker used for visual feature extraction and a phonetic decoder based on a multistream recurrent neural network.
With a decoding accuracy at the phonetic level of 70.88%, the proposed system outperforms our previous CNN-HMM decoder and competes with more complex baselines.
arXiv Detail & Related papers (2022-04-11T09:30:08Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording [46.69852287267763]
We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches.
We also propose a VAD-free inference algorithm that leverages probabilities to determine a suitable timing to reset the model states.
Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one.
arXiv Detail & Related papers (2021-07-15T17:59:10Z) - Sequential End-to-End Intent and Slot Label Classification and
Localization [2.1684857243537334]
end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency.
We propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values.
Results show our model ability to process incoming speech signal, reaching accuracy as high as 98.97 % for CTC and 98.78 % for CTL on single-label classification.
arXiv Detail & Related papers (2021-06-08T19:53:04Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.