3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech
recognition
- URL: http://arxiv.org/abs/2204.03178v1
- Date: Thu, 7 Apr 2022 03:10:49 GMT
- Title: 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech
recognition
- Authors: Zhao You, Shulin Feng, Dan Su, Dong Yu
- Abstract summary: We identify and integrate several approaches to achieve further improvements for ASR tasks.
Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture.
We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement.
- Score: 31.992543274210835
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recently, Conformer based CTC/AED model has become a mainstream architecture
for ASR. In this paper, based on our prior work, we identify and integrate
several approaches to achieve further improvements for ASR tasks, which we
denote as multi-loss, multi-path and multi-level, summarized as "3M" model.
Specifically, multi-loss refers to the joint CTC/AED loss and multi-path
denotes the Mixture-of-Experts(MoE) architecture which can effectively increase
the model capacity without remarkably increasing computation cost. Multi-level
means that we introduce auxiliary loss at multiple level of a deep model to
help training. We evaluate our proposed method on the public WenetSpeech
dataset and experimental results show that the proposed method provides
12.2%-17.6% relative CER improvement over the baseline model trained by Wenet
toolkit. On our large scale dataset of 150k hours corpus, the 3M model has also
shown obvious superiority over the baseline Conformer model.
Related papers
- MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management [35.06717005729781]
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components.
Development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems.
We build a prototype system and evaluate it on various large MT MM models.
Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
arXiv Detail & Related papers (2024-09-05T09:10:40Z) - The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework [46.69058301083775]
Multi-Modal Knowledge Graph (MMKG) representation learning framework is crucial for integrating structured knowledge into multi-modal Large Language Models (LLMs) at scale.
We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking.
Our approach achieves SOTA performance across a total of ten datasets, demonstrating its robustness and versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Mixture-of-Expert Conformer for Streaming Multilingual ASR [33.14594179710925]
We propose a streaming truly multilingual Conformer incorporating mixture-of-expert layers.
The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases.
We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline.
arXiv Detail & Related papers (2023-05-25T02:16:32Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Accurate and Lightweight Image Super-Resolution with Model-Guided Deep
Unfolding Network [63.69237156340457]
We present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN)
MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations)
The superiority of the proposed MoG-DUN method to existing state-of-theart image methods including RCAN, SRDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
arXiv Detail & Related papers (2020-09-14T08:23:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.