Sequence Transduction with Graph-based Supervision
- URL: http://arxiv.org/abs/2111.01272v1
- Date: Mon, 1 Nov 2021 21:51:42 GMT
- Title: Sequence Transduction with Graph-based Supervision
- Authors: Niko Moritz, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux
- Abstract summary: We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
- Score: 96.04967815520193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recurrent neural network transducer (RNN-T) objective plays a major role
in building today's best automatic speech recognition (ASR) systems for
production. Similarly to the connectionist temporal classification (CTC)
objective, the RNN-T loss uses specific rules that define how a set of
alignments is generated to form a lattice for the full-sum training. However,
it is yet largely unknown if these rules are optimal and do lead to the best
possible ASR results. In this work, we present a new transducer objective
function that generalizes the RNN-T loss to accept a graph representation of
the labels, thus providing a flexible and efficient framework to manipulate
training lattices, for example for restricting alignments or studying different
transition rules. We demonstrate that transducer-based ASR with CTC-like
lattice achieves better results compared to standard RNN-T, while also ensuring
a strictly monotonic alignment, which will allow better optimization of the
decoding procedure. For example, the proposed CTC-like transducer system
achieves a word error rate of 5.9% for the test-other condition of LibriSpeech,
corresponding to an improvement of 4.8% relative to an equivalent RNN-T based
system.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech
Recognition [8.302549684364195]
We propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment.
CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.
arXiv Detail & Related papers (2023-07-26T11:59:14Z) - Powerful and Extensible WFST Framework for RNN-Transducer Losses [71.56212119508551]
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss.
Existing implementations of RNN-T use-related code, which is hard to extend and debug.
We introduce two WFST-powered RNN-T implementations: "Compose-Transducer" and "Grid-Transducer"
arXiv Detail & Related papers (2023-03-18T10:36:33Z) - Accelerating RNN-T Training and Inference Using CTC guidance [18.776997761704784]
The proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER)
We demonstrate that the proposed method is able to accelerate the RNN-T inference by 2.2 times with similar or slightly better word error rates (WER)
arXiv Detail & Related papers (2022-10-29T03:39:18Z) - VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - HMM-Free Encoder Pre-Training for Streaming RNN Transducer [9.970995981222993]
This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model.
To our best knowledge, this is the first work to simulate HMM-based frame-wise label using CTC model for pre-training.
arXiv Detail & Related papers (2021-04-02T16:14:11Z) - Synthesizing Context-free Grammars from Recurrent Neural Networks
(Extended Version) [6.3455238301221675]
We present an algorithm for extracting the context free grammars (CFGs) from a trained recurrent neural network (RNN)
We develop a new framework, pattern rule sets (PRSs), which describe sequences of deterministic finite automata (DFAs) that approximate a non-regular language.
We show how the PRS may converted into a CFG, enabling a familiar and useful presentation of the learned language.
arXiv Detail & Related papers (2021-01-20T16:22:25Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.