End-to-End Automatic Speech Recognition with Deep Mutual Learning
- URL: http://arxiv.org/abs/2102.08154v1
- Date: Tue, 16 Feb 2021 13:52:06 GMT
- Title: End-to-End Automatic Speech Recognition with Deep Mutual Learning
- Authors: Ryo Masumura, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Takanori
Ashihara
- Abstract summary: This paper is the first to apply deep mutual learning to end-to-end ASR models.
In DML, multiple models are trained simultaneously and collaboratively by mimicking each other throughout the training process.
We demonstrate that DML improves the ASR performance of both modeling setups compared with conventional learning methods.
- Score: 29.925641799136663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is the first study to apply deep mutual learning (DML) to
end-to-end ASR models. In DML, multiple models are trained simultaneously and
collaboratively by mimicking each other throughout the training process, which
helps to attain the global optimum and prevent models from making
over-confident predictions. While previous studies applied DML to simple
multi-class classification problems, there are no studies that have used it on
more complex sequence-to-sequence mapping problems. For this reason, this paper
presents a method to apply DML to state-of-the-art Transformer-based end-to-end
ASR models. In particular, we propose to combine DML with recent representative
training techniques. i.e., label smoothing, scheduled sampling, and
SpecAugment, each of which are essential for powerful end-to-end ASR models. We
expect that these training techniques work well with DML because DML has
complementary characteristics. We experimented with two setups for Japanese ASR
tasks: large-scale modeling and compact modeling. We demonstrate that DML
improves the ASR performance of both modeling setups compared with conventional
learning methods including knowledge distillation. We also show that combining
DML with the existing training techniques effectively improves ASR performance.
Related papers
- LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management [35.06717005729781]
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components.
Development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems.
We build a prototype system and evaluate it on various large MT MM models.
Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
arXiv Detail & Related papers (2024-09-05T09:10:40Z) - CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for
E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results.
In pre-training stage, we adopt mlm task, classification task and contrastive learning task.
In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Revisiting Training Strategies and Generalization Performance in Deep
Metric Learning [28.54755295856929]
We revisit the most widely used DML objective functions and conduct a study of the crucial parameter choices.
Under consistent comparison, DML objectives show much higher saturation than indicated by literature.
Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models.
arXiv Detail & Related papers (2020-02-19T22:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.