Related papers: Knowing When to Quit: Probabilistic Early Exits for Speech Separation

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

URL: http://arxiv.org/abs/2507.09768v2
Date: Sun, 20 Jul 2025 18:30:26 GMT
Title: Knowing When to Quit: Probabilistic Early Exits for Speech Separation
Authors: Kenny Falkær Olsen, Mads Østergaard, Karl Ulbæk, Søren Føns Nielsen, Rasmus Malik Høegh Lindrup, Bjørn Sand Jensen, Morten Mørup,
Abstract summary: We propose a neural network architecture for speech separation capable of early-exit.<n>We show that a single early-exit model can be competitive with state-of-the-art models trained at many compute and parameter budgets.
Score: 2.840381306234341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget, and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks, and we show that a single early-exit model can be competitive with state-of-the-art models trained at many compute and parameter budgets. Our framework enables fine-grained dynamic compute-scaling of speech separation networks while achieving state-of-the-art performance and interpretable exit conditions.

Related papers

Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices [11.05223262950967]
Speech recognition software needs to be able to adjust the computational load of neural models during inference in a resource aware manner.<n>Early-exit architectures process the input with a subset of their layers, exiting at intermediate branches.<n>For automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis.<n>We show that in this way the speech recognition performance on standard benchmarks significantly improve, at the cost of a small increase in the overall number of model parameters.
arXiv Detail & Related papers (2025-06-22T13:34:18Z)
Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application [55.42071552739813]
We propose a novel semantic communication framework empowered by generative artificial intelligence (GAI)<n>A latent diffusion model (LDM)-based semantic communication framework is proposed that combines a variational autoencoder for semantic features extraction.<n>The proposed system is a training-free framework that supports zero-shot generalization, and achieves superior performance under low-SNR and out-of-distribution conditions.
arXiv Detail & Related papers (2025-06-06T03:20:32Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Unsupervised Composable Representations for Audio [0.9888599167642799]
Current generative models are able to generate high-quality artefacts but have been shown to struggle with compositional reasoning. In this paper, we focus on the problem of compositional representation learning for music data, specifically targeting the fully-unsupervised setting. We propose a framework that leverages an explicit compositional inductive bias, defined by a flexible auto-encoding objective.
arXiv Detail & Related papers (2024-08-19T08:41:09Z)
Discrete Neural Algorithmic Reasoning [18.497863598167257]
We propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. trained with supervision on the algorithm's state transitions, such models are able to perfectly align with the original algorithm.
arXiv Detail & Related papers (2024-02-18T16:03:04Z)
Training dynamic models using early exits for automatic speech recognition on resource-constrained devices [15.879328412777008]
Early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands. We show that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models. Results provide insights into the training dynamics of early-exit architectures for ASR models.
arXiv Detail & Related papers (2023-09-18T07:45:16Z)
Exploiting Temporal Structures of Cyclostationary Signals for Data-Driven Single-Channel Source Separation [98.95383921866096]
We study the problem of single-channel source separation (SCSS) We focus on cyclostationary signals, which are particularly suitable in a variety of application domains. We propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator.
arXiv Detail & Related papers (2022-08-22T14:04:56Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)
CDLNet: Robust and Interpretable Denoising Through Deep Convolutional Dictionary Learning [6.6234935958112295]
Unrolled optimization networks propose an interpretable alternative to constructing deep neural networks. We show that the proposed model outperforms the state-of-the-art denoising models when scaled to similar parameter count.
arXiv Detail & Related papers (2021-03-05T01:15:59Z)
Firearm Detection via Convolutional Neural Networks: Comparing a Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents. One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis. We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z)
Dataless Model Selection with the Deep Frame Potential [45.16941644841897]
We quantify networks by their intrinsic capacity for unique and robust representations. We propose the deep frame potential: a measure of coherence that is approximately related to representation stability but has minimizers that depend only on network structure. We validate its use as a criterion for model selection and demonstrate correlation with generalization error on a variety of common residual and densely connected network architectures.
arXiv Detail & Related papers (2020-03-30T23:27:25Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.