Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness
of Multi-Stream End-to-End ASR
- URL: http://arxiv.org/abs/2102.03055v1
- Date: Fri, 5 Feb 2021 08:36:58 GMT
- Title: Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness
of Multi-Stream End-to-End ASR
- Authors: Ruizhi Li and Gregory Sell and Hynek Hermansky
- Abstract summary: In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics.
We introduce a two-stage augmentation scheme focusing on mismatch scenarios.
Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3%.
- Score: 35.7018440502825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performance degradation of an Automatic Speech Recognition (ASR) system is
commonly observed when the test acoustic condition is different from training.
Hence, it is essential to make ASR systems robust against various environmental
distortions, such as background noises and reverberations. In a multi-stream
paradigm, improving robustness takes account of handling a variety of unseen
single-stream conditions and inter-stream dynamics. Previously, a practical
two-stage training strategy was proposed within multi-stream end-to-end ASR,
where Stage-2 formulates the multi-stream model with features from Stage-1
Universal Feature Extractor (UFE). In this paper, as an extension, we introduce
a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1
Augmentation aims to address single-stream input varieties with data
augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE
features of randomly selected streams to simulate diverse stream combinations.
During inference, we also present adaptive Connectionist Temporal
Classification (CTC) fusion with the help of hierarchical attention mechanisms.
Experiments have been conducted on two datasets, DIRHA and AMI, as a
multi-stream scenario. Compared with the previous training strategy,
substantial improvements are reported with relative word error rate reductions
of 29.7-59.3% across several unseen stream combinations.
Related papers
- Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Online Boosting Adaptive Learning under Concept Drift for Multistream
Classification [34.64751041290346]
Multistream classification poses significant challenges due to the necessity for rapid adaptation in dynamic streaming processes with concept drift.
We propose a novel Online Boosting Adaptive Learning (OBAL) method that adaptively learns the dynamic correlation among different streams.
arXiv Detail & Related papers (2023-12-17T23:10:39Z) - Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures [12.703947839247693]
Diffusion models, emerging as powerful deep generative tools, excel in various applications.
However, their remarkable generative performance is hindered by slow training and sampling.
This is due to the necessity of tracking extensive forward and reverse diffusion trajectories.
We present a multi-stage framework inspired by our empirical findings to tackle these challenges.
arXiv Detail & Related papers (2023-12-14T17:48:09Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - PAT: Position-Aware Transformer for Dense Multi-Label Action Detection [36.39340228621982]
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video.
We embed relative positional encoding in the self-attention mechanism and exploit multi-scale temporal relationships.
We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets.
arXiv Detail & Related papers (2023-08-09T16:29:31Z) - AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub.
It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations.
Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z) - Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel
Transformer [29.03463312813923]
Video denoising aims to recover high-quality frames from the noisy video.
Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content.
We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
arXiv Detail & Related papers (2022-04-30T09:01:21Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction [72.37440317774556]
We propose advances that address two key challenges in future trajectory prediction.
multimodality in both training data and predictions and constant time inference regardless of number of agents.
arXiv Detail & Related papers (2020-07-26T08:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.