Related papers: Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks

Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks

URL: http://arxiv.org/abs/2601.16117v2
Date: Tue, 27 Jan 2026 11:20:19 GMT
Title: Distillation-based Layer Dropping (DLD): Effective End-to-end Framework for Dynamic Speech Networks
Authors: Abdul Hannan, Daniele Falavigna, Shah Nawaz, Mubashir Noman, Markus Schedl, Alessio Brutti,
Abstract summary: layer dropping ($mathcalLD$) approach is typically used to transform static models into dynamic ones.<n>We propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $mathcalLD$ in an end-to-end fashion.<n>Our framework reduces the word error rate by $9.32%$ and $2.25%$ for high and no dropping cases with $33.3%$ reduction in training time.
Score: 20.54366796766549
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.

Related papers

Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models [17.818685759025207]
Layer-wise pruning is a commonly employed strategy to mitigate inference costs.<n>This paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game.<n>It achieves more efficient and effective layer-wise pruning for large language models.
arXiv Detail & Related papers (2026-02-08T03:51:36Z)
A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs [7.577235739757108]
Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge.<n>This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations.
arXiv Detail & Related papers (2025-11-21T10:55:44Z)
The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models [33.90597962418094]
We propose CLP, a novel continuous layer pruning framework for large language models.<n>CLP uses differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning.<n>CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
arXiv Detail & Related papers (2025-10-25T16:40:17Z)
DyMoDreamer: World Modeling with Dynamic Modulation [52.27044216359359]
A critical bottleneck in deep reinforcement learning (DRL) is sample inefficiency, as training high-performance agents often demands extensive environmental interactions.<n>We introduce DyMoDreamer, a novel algorithm that incorporates a dynamic modulation mechanism to improve the extraction of dynamic features and enrich the temporal information.<n>Experiments demonstrate that DyMoDreamer sets a new state-of-the-art on the Atari $100$k benchmark with a $156.6$% mean human-normalized score.
arXiv Detail & Related papers (2025-09-29T13:54:42Z)
Deep Hierarchical Learning with Nested Subspace Networks [53.71337604556311]
We propose Nested Subspace Networks (NSNs) for large neural networks.<n>NSNs enable a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets.<n>We show that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier.
arXiv Detail & Related papers (2025-09-22T15:13:14Z)
Input Conditioned Layer Dropping in Speech Foundation Models [11.05223262950967]
layer dropping ($mathcalLD$) skips fraction of the layers of a backbone network during inference to reduce the computational load.<n>We propose input-driven $mathcalLD$ that employs the network's input features and a lightweight layer selecting network to determine optimum combination of processing layers.
arXiv Detail & Related papers (2025-07-10T17:39:03Z)
Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z)
Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models [31.103832542711864]
Balcony is a framework for depth-based dynamic inference.<n>It maintains the full model's performance while enabling real-time adaptation to different computational budgets.<n>Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip.
arXiv Detail & Related papers (2025-03-06T22:09:55Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
DyCE: Dynamically Configurable Exiting for Deep Learning Compression and Real-time Scaling [1.8350044465969415]
DyCE can adjust the performance-complexity trade-off of a deep learning model at runtime without requiring re-initialization or redeployment on inference hardware.<n>DyCE significantly reduces computational complexity by 23.5% for ResNet152 and 25.9% for ConvNextv2-tiny on ImageNet, with accuracy reductions of less than 0.5%.
arXiv Detail & Related papers (2024-03-04T03:09:28Z)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities. LVLMs are often problematic due to their massive computational/energy costs and carbon consumption. We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.