Related papers: Dynamic Acoustic Model Architecture Optimization in Training for ASR

Dynamic Acoustic Model Architecture Optimization in Training for ASR

URL: http://arxiv.org/abs/2506.13180v2
Date: Wed, 18 Jun 2025 08:02:29 GMT
Title: Dynamic Acoustic Model Architecture Optimization in Training for ASR
Authors: Jingjing Xu, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlueter, Hermann Ney,
Abstract summary: DMAO is an architecture optimization framework that employs a grow-and-drop strategy to automatically reallocate parameters during training.<n>We evaluate DMAO through experiments with CTC onSpeech, TED-LIUM-v2 and Switchboard datasets.
Score: 51.21112094223223
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Architecture design is inherently complex. Existing approaches rely on either handcrafted rules, which demand extensive empirical expertise, or automated methods like neural architecture search, which are computationally intensive. In this paper, we introduce DMAO, an architecture optimization framework that employs a grow-and-drop strategy to automatically reallocate parameters during training. This reallocation shifts resources from less-utilized areas to those parts of the model where they are most beneficial. Notably, DMAO only introduces negligible training overhead at a given model complexity. We evaluate DMAO through experiments with CTC on LibriSpeech, TED-LIUM-v2 and Switchboard datasets. The results show that, using the same amount of training resources, our proposed DMAO consistently improves WER by up to 6% relatively across various architectures, model sizes, and datasets. Furthermore, we analyze the pattern of parameter redistribution and uncover insightful findings.

Related papers

Meta knowledge assisted Evolutionary Neural Architecture Search [38.55611683982936]
This paper introduces an efficient EC-based NAS method to solve problems via an innovative meta-learning framework.<n>An adaptive surrogate model is designed through an adaptive threshold to select the potential architectures.<n> Experiments on CIFAR-10, CIFAR-100, and ImageNet1K datasets demonstrate that the proposed method achieves high performance comparable to that of many state-of-the-art peer methods.
arXiv Detail & Related papers (2025-04-30T11:43:07Z)
ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity.<n>This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics.<n>Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z)
A Theoretical Framework for Data Efficient Multi-Source Transfer Learning Based on Cramér-Rao Bound [16.49737340580437]
We propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model?<n>Specifically, we introduce a generalization error measure that aligns with cross-entropy loss, and minimize it based on the Cram'er-Rao Bound to determine the optimal transfer quantity for each source task.<n>We develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for training deep multi-source transfer learning models.
arXiv Detail & Related papers (2025-02-06T17:32:49Z)
Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis. We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z)
Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Grassroots Operator Search for Model Edge Adaptation [2.1756721838833797]
Hardware-aware Neural Architecture (HW-NAS) is increasingly being used to design efficient deep learning architectures. We present a Grassroots Operator Search (GOS) methodology to search for efficient operator replacement. Our method consistently outperforms the original models on two edge devices, with a minimum of 2.2x speedup while maintaining high accuracy.
arXiv Detail & Related papers (2023-09-20T12:15:58Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Build a Robust QA System with Transformer-based Mixture of Experts [0.29005223064604074]
We build a robust question answering system that can adapt to out-of-domain datasets. We show that our combination of best architecture and data augmentation techniques achieves a 53.477 F1 score in the out-of-domain evaluation.
arXiv Detail & Related papers (2022-03-20T02:38:29Z)
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics. With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture. Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.