Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR
- URL: http://arxiv.org/abs/2311.14835v2
- Date: Thu, 30 Nov 2023 20:18:56 GMT
- Title: Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR
- Authors: Jintao Jiang, Yingbo Gao, Zoltan Tuske
- Abstract summary: We create weak alignment supervision from an existing hybrid system to aid the end-to-end modeling of automatic speech recognition.
We then create a cross-entropy loss at a certain layer of the encoder using the derived alignments.
In contrast to the general one-hot cross-entropy losses, here we use a cross-entropy loss with a label smoothing parameter to regularize the supervision.
- Score: 5.2823268671093775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we aim to create weak alignment supervision from an existing
hybrid system to aid the end-to-end modeling of automatic speech recognition.
Towards this end, we use the existing hybrid ASR system to produce triphone
alignments of the training audios. We then create a cross-entropy loss at a
certain layer of the encoder using the derived alignments. In contrast to the
general one-hot cross-entropy losses, here we use a cross-entropy loss with a
label smoothing parameter to regularize the supervision. As a comparison, we
also conduct the experiments with one-hot cross-entropy losses and CTC losses
with loss weighting. The results show that placing the weak alignment
supervision with the label smoothing parameter of 0.5 at the third encoder
layer outperforms the other two approaches and leads to about 5\% relative WER
reduction on the TED-LIUM 2 dataset over the baseline. We see similar
improvements when applying the method out-of-the-box on a Tagalog end-to-end
ASR system.
Related papers
- Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model
Improves End-to-End ASR [9.24160000451216]
alternating weak triphone/BPE alignment supervision is proposed to improve end-to-end model training.
We show that either triphone or BPE alignment based weak supervision improves ASR performance over standard CTC auxiliary loss.
arXiv Detail & Related papers (2024-02-23T20:26:54Z) - RomniStereo: Recurrent Omnidirectional Stereo Matching [6.153793254880079]
We propose a recurrent omnidirectional stereo matching (RomniStereo) algorithm.
Our best model improves the average MAE metric by 40.7% over the previous SOTA baseline.
When visualizing the results, our models demonstrate clear advantages on both synthetic and realistic examples.
arXiv Detail & Related papers (2024-01-09T04:06:01Z) - Gait Cycle Reconstruction and Human Identification from Occluded
Sequences [2.198430261120653]
We propose an effective neural network-based model to reconstruct the occluded frames in an input sequence before carrying out gait recognition.
We employ LSTM networks to predict an embedding for each occluded frame both from the forward and the backward directions.
While the LSTMs are trained to minimize the mean-squared loss, the fusion network is trained to optimize the pixel-wise cross-entropy loss between the ground-truth and the reconstructed samples.
arXiv Detail & Related papers (2022-06-20T16:04:31Z) - The KFIoU Loss for Rotated Object Detection [115.334070064346]
In this paper, we argue that one effective alternative is to devise an approximate loss who can achieve trend-level alignment with SkewIoU loss.
Specifically, we model the objects as Gaussian distribution and adopt Kalman filter to inherently mimic the mechanism of SkewIoU.
The resulting new loss called KFIoU is easier to implement and works better compared with exact SkewIoU.
arXiv Detail & Related papers (2022-01-29T10:54:57Z) - Label Distributionally Robust Losses for Multi-class Classification:
Consistency, Robustness and Adaptivity [55.29408396918968]
We study a family of loss functions named label-distributionally robust (LDR) losses for multi-class classification.
Our contributions include both consistency and robustness by establishing top-$k$ consistency of LDR losses for multi-class classification.
We propose a new adaptive LDR loss that automatically adapts the individualized temperature parameter to the noise degree of class label of each instance.
arXiv Detail & Related papers (2021-12-30T00:27:30Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Finite-time System Identification and Adaptive Control in Autoregressive
Exogenous Systems [79.67879934935661]
We study the problem of system identification and adaptive control of unknown ARX systems.
We provide finite-time learning guarantees for the ARX systems under both open-loop and closed-loop data collection.
arXiv Detail & Related papers (2021-08-26T18:00:00Z) - Class Interference Regularization [7.248447600071719]
Contrastive losses yield state-of-the-art performance for person re-identification, face verification and few shot learning.
We propose a novel, simple and effective regularization technique, the Class Interference Regularization (CIR)
CIR perturbs the output features by randomly moving them towards the average embeddings of the negative classes.
arXiv Detail & Related papers (2020-09-04T21:03:32Z) - SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector.
It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection.
Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z) - AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching [50.06646151004375]
A novel domain-adaptive pipeline called AdaStereo aims to align multi-level representations for deep stereo matching networks.
Our AdaStereo models achieve state-of-the-art cross-domain performance on multiple stereo benchmarks, including KITTI, Middlebury, ETH3D, and DrivingStereo.
arXiv Detail & Related papers (2020-04-09T16:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.