Related papers: Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

URL: http://arxiv.org/abs/2102.03055v1
Date: Fri, 5 Feb 2021 08:36:58 GMT
Title: Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR
Authors: Ruizhi Li and Gregory Sell and Hynek Hermansky
Abstract summary: In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. We introduce a two-stage augmentation scheme focusing on mismatch scenarios. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3%.
Score: 35.7018440502825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During inference, we also present adaptive Connectionist Temporal Classification (CTC) fusion with the help of hierarchical attention mechanisms. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3% across several unseen stream combinations.

Related papers

Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z)
Adapformer: Adaptive Channel Management for Multivariate Time Series Forecasting [49.40321003932633]
Adapformer is an advanced Transformer-based framework that merges the benefits of CI and CD methodologies through effective channel management.<n>Adapformer achieves superior performance over existing models, enhancing both predictive accuracy and computational efficiency.
arXiv Detail & Related papers (2025-11-18T16:24:05Z)
Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training [20.071957855504206]
Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement.<n>We introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model.<n>Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU.
arXiv Detail & Related papers (2025-09-25T20:09:05Z)
DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection [7.117824587276951]
This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Stemporal Framework (DAMS), which is based on multilevel feature and decoupling fusion.<n>The main processing path integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM)
arXiv Detail & Related papers (2025-07-28T08:42:00Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [50.438552588818]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
Layer-wise Quantization for Quantized Optimistic Dual Averaging [75.4148236967503]
We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training.<n>We propose a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs.
arXiv Detail & Related papers (2025-05-20T13:53:58Z)
DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting [13.05900224897486]
Real-world time series often show heterogeneous temporal patterns caused by distribution shifts over time. correlations among channels are complex and intertwined, making it hard to model the interactions among channels precisely and flexibly. We propose a general framework called DUET, which introduces dual clustering on the temporal and channel dimensions.
arXiv Detail & Related papers (2024-12-14T15:15:17Z)
Synesthesia of Machines (SoM)-Enhanced ISAC Precoding for Vehicular Networks with Double Dynamics [15.847713094328286]
Integrated sensing and communication (ISAC) technology plays a crucial role in vehicular networks. Double dynamics present significant challenges for real-time ISAC precoding design. We propose a synesthesia of machine (SoM)-enhanced precoding paradigm.
arXiv Detail & Related papers (2024-08-24T10:35:10Z)
Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Online Boosting Adaptive Learning under Concept Drift for Multistream Classification [34.64751041290346]
Multistream classification poses significant challenges due to the necessity for rapid adaptation in dynamic streaming processes with concept drift. We propose a novel Online Boosting Adaptive Learning (OBAL) method that adaptively learns the dynamic correlation among different streams.
arXiv Detail & Related papers (2023-12-17T23:10:39Z)
Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures [12.703947839247693]
Diffusion models, emerging as powerful deep generative tools, excel in various applications. However, their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories. We present a multi-stage framework inspired by our empirical findings to tackle these challenges.
arXiv Detail & Related papers (2023-12-14T17:48:09Z)
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders. Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z)
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub. It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z)
Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer [29.03463312813923]
Video denoising aims to recover high-quality frames from the noisy video. Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content. We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
arXiv Detail & Related papers (2022-04-30T09:01:21Z)
Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction [72.37440317774556]
We propose advances that address two key challenges in future trajectory prediction. multimodality in both training data and predictions and constant time inference regardless of number of agents.
arXiv Detail & Related papers (2020-07-26T08:17:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.