Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model
Improves End-to-End ASR
- URL: http://arxiv.org/abs/2402.15594v1
- Date: Fri, 23 Feb 2024 20:26:54 GMT
- Title: Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model
Improves End-to-End ASR
- Authors: Jintao Jiang, Yingbo Gao, Mohammad Zeineldeen, Zoltan Tuske
- Abstract summary: alternating weak triphone/BPE alignment supervision is proposed to improve end-to-end model training.
We show that either triphone or BPE alignment based weak supervision improves ASR performance over standard CTC auxiliary loss.
- Score: 9.24160000451216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, alternating weak triphone/BPE alignment supervision is
proposed to improve end-to-end model training. Towards this end, triphone and
BPE alignments are extracted using a pre-existing hybrid ASR system. Then,
regularization effect is obtained by cross-entropy based intermediate auxiliary
losses computed on such alignments at a mid-layer representation of the encoder
for triphone alignments and at the encoder for BPE alignments. Weak supervision
is achieved through strong label smoothing with parameter of 0.5. Experimental
results on TED-LIUM 2 indicate that either triphone or BPE alignment based weak
supervision improves ASR performance over standard CTC auxiliary loss.
Moreover, their combination lowers the word error rate further. We also
investigate the alternation of the two auxiliary tasks during model training,
and additional performance gain is observed. Overall, the proposed techniques
result in over 10% relative error rate reduction over a CTC-regularized
baseline system.
Related papers
- Joint Unsupervised and Supervised Training for Automatic Speech
Recognition via Bilevel Optimization [73.98386682604122]
We present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term bi-level joint unsupervised and supervised training (BL-JUST).
BL-JUST employs a lower and upper level optimization with an unsupervised loss and a supervised loss respectively, leveraging recent advances in penalty-based bilevel optimization to solve this challenging ASR problem with affordable complexity and rigorous convergence guarantees.
arXiv Detail & Related papers (2024-01-13T05:01:47Z) - Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR [5.2823268671093775]
We create weak alignment supervision from an existing hybrid system to aid the end-to-end modeling of automatic speech recognition.
We then create a cross-entropy loss at a certain layer of the encoder using the derived alignments.
In contrast to the general one-hot cross-entropy losses, here we use a cross-entropy loss with a label smoothing parameter to regularize the supervision.
arXiv Detail & Related papers (2023-11-24T20:14:28Z) - Learning Repeatable Speech Embeddings Using An Intra-class Correlation
Regularizer [16.716653844774374]
We evaluate the repeatability of embeddings using the intra-class correlation coefficient (ICC)
We propose a novel regularizer, the ICC regularizer, as a complementary component for contrastive losses to guide deep neural networks to produce embeddings with higher repeatability.
We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice.
arXiv Detail & Related papers (2023-10-25T23:21:46Z) - Deep Autoencoder-based Z-Interference Channels with Perfect and
Imperfect CSI [14.04355073946466]
A deep autoencoder (DAE)-based structure for endto-end communication over the two-user Z-interference channel (ZIC) with finite-alphabet inputs is designed in this paper.
The proposed structure jointly optimize the two encoder/decoder pairs and generates interference-aware constellations that dynamically adapt their shape based on interference intensity to minimize the bit error rate (BER)
An in-phase/quadrature-phase (I/Q) power allocation layer is introduced in the DAE to guarantee an average power constraint and enable the architecture to generate constellations with nonuniform shapes.
arXiv Detail & Related papers (2023-10-23T15:23:42Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Mitigating the Alignment Tax of RLHF [76.4300447532456]
aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax.
We propose model averaging to maximize alignment performance while incurring minimal alignment tax.
We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
arXiv Detail & Related papers (2023-09-12T14:16:54Z) - Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z) - ADC-Net: An Open-Source Deep Learning Network for Automated Dispersion
Compensation in Optical Coherence Tomography [0.0]
This study is to develop a deep learning network for automated dispersion compensation (ADC-Net) in optical coherence tomography ( OCT)
The ADC-Net is based on a redesigned UNet architecture which employs an encoder-decoder pipeline.
Two numeric parameters, i.e., peak signal to noise ratio (PSNR) and structural similarity index metric computed at multiple scales (MS-SSIM) were used for objective assessment of the ADC-Net performance.
arXiv Detail & Related papers (2022-01-29T17:23:46Z) - The KFIoU Loss for Rotated Object Detection [115.334070064346]
In this paper, we argue that one effective alternative is to devise an approximate loss who can achieve trend-level alignment with SkewIoU loss.
Specifically, we model the objects as Gaussian distribution and adopt Kalman filter to inherently mimic the mechanism of SkewIoU.
The resulting new loss called KFIoU is easier to implement and works better compared with exact SkewIoU.
arXiv Detail & Related papers (2022-01-29T10:54:57Z) - Reconcile Prediction Consistency for Balanced Object Detection [10.61438063305309]
We propose a Harmonic loss to harmonize the optimization of classification branch and localization branch.
The Harmonic loss enables these two branches to supervise and promote each other during training.
In order to prevent the localization loss from being dominated by outliers during training phase, a Harmonic IoU loss is proposed to harmonize the weight of the localization loss of different IoU-level samples.
arXiv Detail & Related papers (2021-08-24T15:52:11Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.