Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
- URL: http://arxiv.org/abs/2308.10107v1
- Date: Sat, 19 Aug 2023 20:48:16 GMT
- Title: Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
- Authors: Jinchuan Tian, Jianwei Yu, Hangting Chen, Brian Yan, Chao Weng, Dong
Yu, Shinji Watanabe
- Abstract summary: Bayes Risk Transducer (BRT) proposed to enforce preferred paths and achieve controllable alignment prediction.
BRT saves inference cost by up to 46% for non-streaming ASR and reduces overall system latency by 41% for streaming ASR.
- Score: 79.41540601816315
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Automatic speech recognition (ASR) based on transducers is widely used. In
training, a transducer maximizes the summed posteriors of all paths. The path
with the highest posterior is commonly defined as the predicted alignment
between the speech and the transcription. While the vanilla transducer does not
have a prior preference for any of the valid paths, this work intends to
enforce the preferred paths and achieve controllable alignment prediction.
Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a
Bayes risk function to set lower risk values to the preferred paths so that the
predicted alignment is more likely to satisfy specific desired properties. We
further demonstrate that these predicted alignments with intentionally designed
properties can provide practical advantages over the vanilla transducer.
Experimentally, the proposed BRT saves inference cost by up to 46% for
non-streaming ASR and reduces overall system latency by 41% for streaming ASR.
Related papers
- Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study [52.91899050612153]
transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR)
Our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated.
This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems' capabilities.
arXiv Detail & Related papers (2024-09-26T11:31:18Z) - Rolling Shutter Correction with Intermediate Distortion Flow Estimation [55.59359977619609]
This paper proposes to correct the rolling shutter (RS) distorted images by estimating the distortion flow from the global shutter (GS) to RS directly.
Existing methods usually perform correction using the undistortion flow from the RS to GS.
We introduce a new framework that directly estimates the distortion flow and rectifies the RS image with the backward warping operation.
arXiv Detail & Related papers (2024-04-09T14:40:54Z) - Algorithm for AGC index management against crowded radio environment [0.0]
This paper describes a receiver that uses an innovative method to predict, according to history of receiver operating metrics (packet lost/well received), the optimum automatic gain control (AGC) index or most appropriate variable gain range to be used for next packet reception, anticipating an interferer appearing during the payload reception.
This allows the receiver to have higher immunity to interferers even if they occur during the gain frozen payload reception period whilst still ensuring an optimum sensitivity level.
arXiv Detail & Related papers (2024-03-19T05:42:29Z) - Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning [8.042684255871707]
This paper transforms lane rendering image anomaly detection into a classification problem.
It proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing.
Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection.
arXiv Detail & Related papers (2023-12-07T16:10:10Z) - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer [60.31021888394358]
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR)
We propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data.
arXiv Detail & Related papers (2023-03-31T03:14:44Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Automatic Detection of Rail Components via A Deep Convolutional
Transformer Network [7.557470133155959]
We propose a deep convolutional transformer network based method to detect multi-class rail components including the rail, clip, and bolt.
Our proposed method simplifies the detection pipeline by eliminating the need of prior settings, such as anchor box, aspect ratio, default coordinates, and post-processing.
Results of a comprehensive computational study show that our proposed method outperforms a set of existing state-of-art approaches with large margins.
arXiv Detail & Related papers (2021-08-05T07:38:04Z) - A Secure Deep Probabilistic Dynamic Thermal Line Rating Prediction [0.0]
This paper presents a secure yet sharp probabilistic prediction model for the hour-ahead forecasting of the dynamic thermal line rating (DTLR)
The security of the proposed DTLR limits the frequency of DTLR prediction exceeding the actual DTLR.
By introducing a customized cost function, the deep neural network is trained to consider the DTLR security based on the required probability of exceedance.
arXiv Detail & Related papers (2020-11-21T23:20:58Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.