4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict
decoders
- URL: http://arxiv.org/abs/2212.10818v2
- Date: Mon, 29 May 2023 23:16:56 GMT
- Title: 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict
decoders
- Authors: Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe
- Abstract summary: This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict.
The four decoders are jointly trained so that they can be easily switched depending on the application scenarios.
The experimental results showed that the proposed model consistently reduced the WER.
- Score: 29.799797974513552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The network architecture of end-to-end (E2E) automatic speech recognition
(ASR) can be classified into several models, including connectionist temporal
classification (CTC), recurrent neural network transducer (RNN-T), attention
mechanism, and non-autoregressive mask-predict models. Since each of these
network architectures has pros and cons, a typical use case is to switch these
separate models depending on the application requirement, resulting in the
increased overhead of maintaining all models. Several methods for integrating
two of these complementary models to mitigate the overhead issue have been
proposed; however, if we integrate more models, we will further benefit from
these complementary models and realize broader applications with a single
system. This paper proposes four-decoder joint modeling (4D) of CTC, attention,
RNN-T, and mask-predict, which has the following three advantages: 1) The four
decoders are jointly trained so that they can be easily switched depending on
the application scenarios. 2) Joint training may bring model regularization and
improve the model robustness thanks to their complementary properties. 3) Novel
one-pass joint decoding methods using CTC, attention, and RNN-T further
improves the performance. The experimental results showed that the proposed
model consistently reduced the WER.
Related papers
- 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - DiTMoS: Delving into Diverse Tiny-Model Selection on Microcontrollers [34.282971510732736]
We introduce DiTMoS, a novel DNN training and inference framework with a selector-classifiers architecture.
A composition of weak models can exhibit high diversity and the union of them can significantly boost the accuracy upper bound.
We deploy DiTMoS on the Neucleo STM32F767ZI board and evaluate it based on three time-series datasets for human activity recognition, keywords spotting, and emotion recognition.
arXiv Detail & Related papers (2024-03-14T02:11:38Z) - Systematic Architectural Design of Scale Transformed Attention Condenser
DNNs via Multi-Scale Class Representational Response Similarity Analysis [93.0013343535411]
We propose a novel type of analysis called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim)
We show that adding STAC modules to ResNet style architectures can result in up to a 1.6% increase in top-1 accuracy.
Results from ClassRepSim analysis can be used to select an effective parameterization of the STAC module resulting in competitive performance.
arXiv Detail & Related papers (2023-06-16T18:29:26Z) - 3D Convolutional with Attention for Action Recognition [6.238518976312625]
Current action recognition methods use computationally expensive models for learning-temporal dependencies of the action.
This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected layers and attention layer.
The method first learns spatial features and temporal of actions through 3D-CNN, and then the attention temporal mechanism helps the model to locate attention to essential features.
arXiv Detail & Related papers (2022-06-05T15:12:57Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - Rate Distortion Characteristic Modeling for Neural Image Compression [59.25700168404325]
End-to-end optimization capability offers neural image compression (NIC) superior lossy compression performance.
distinct models are required to be trained to reach different points in the rate-distortion (R-D) space.
We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep network and statistical modeling.
arXiv Detail & Related papers (2021-06-24T12:23:05Z) - DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator
Search [55.164053971213576]
convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead.
Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure.
Existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space.
arXiv Detail & Related papers (2020-11-04T07:43:01Z) - Single-Layer Graph Convolutional Networks For Recommendation [17.3621098912528]
Graph Convolutional Networks (GCNs) have received significant attention and achieved start-of-the-art performances on recommendation tasks.
Existing GCN models tend to perform recursion aggregations among all related nodes, which arises severe computational burden.
We propose a single GCN layer to aggregate information from the neighbors filtered by DA similarity and then generates the node representations.
arXiv Detail & Related papers (2020-06-07T14:38:47Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.