Scaling strategies for on-device low-complexity source separation with
Conv-Tasnet
- URL: http://arxiv.org/abs/2303.03005v1
- Date: Mon, 6 Mar 2023 10:15:14 GMT
- Title: Scaling strategies for on-device low-complexity source separation with
Conv-Tasnet
- Authors: Mohamed Nabih Ali, Francesco Paissan, Daniele Falavigna, Alessio
Brutti
- Abstract summary: Several very effective neural approaches for single-channel speech separation have been presented in the literature.
Due to the size and complexity of these models, their use on low-resource devices, e.g. for hearing aids, and earphones, is still a challenge.
We consider three parameters that directly control the overall size of the model, namely: the number of residual blocks, the number of repetitions of the separation blocks and the number of channels in the depth-wise convolutions.
- Score: 8.40565031143262
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, several very effective neural approaches for single-channel speech
separation have been presented in the literature. However, due to the size and
complexity of these models, their use on low-resource devices, e.g. for hearing
aids, and earphones, is still a challenge and established solutions are not
available yet. Although approaches based on either pruning or compressing
neural models have been proposed, the design of a model architecture suitable
for a certain application domain often requires heuristic procedures not easily
portable to different low-resource platforms. Given the modular nature of the
well-known Conv-Tasnet speech separation architecture, in this paper we
consider three parameters that directly control the overall size of the model,
namely: the number of residual blocks, the number of repetitions of the
separation blocks and the number of channels in the depth-wise convolutions,
and experimentally evaluate how they affect the speech separation performance.
In particular, experiments carried out on the Libri2Mix show that the number of
dilated 1D-Conv blocks is the most critical parameter and that the usage of
extra-dilation in the residual blocks allows reducing the performance drop.
Related papers
- Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression [90.59962443790593]
In this paper, we present a variable-rate image compression model based on invertible transform to overcome limitations.
Specifically, we design a lightweight multi-scale invertible neural network, which maps the input image into multi-scale latent representations.
Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods.
arXiv Detail & Related papers (2025-03-27T09:08:39Z) - FoldGPT: Simple and Effective Large Language Model Compression Scheme [5.611544096046119]
Network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices.
We propose FoldGPT, which combines block removal and block parameter sharing.
Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression.
arXiv Detail & Related papers (2024-07-01T03:17:53Z) - MCNC: Manifold-Constrained Reparameterization for Neural Compression [21.70510507535041]
We present a novel model compression method, which we term Manifold-Constrained Neural Compression (MCNC)
By constraining the parameter space to our proposed manifold, we can identify high-quality solutions.
Our method significantly outperforms state-of-the-art baselines in terms of compression, accuracy, and/or model reconstruction time.
arXiv Detail & Related papers (2024-06-27T16:17:26Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Big model only for hard audios: Sample dependent Whisper model selection
for efficient inferences [7.592727209806414]
Several ASR models exist in various sizes, with different inference costs leading to different performance levels.
We propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription.
By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
arXiv Detail & Related papers (2023-09-22T08:50:58Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Deformable Temporal Convolutional Networks for Monaural Noisy
Reverberant Speech Separation [26.94528951545861]
Speech separation models are used for isolating individual speakers in many speech processing applications.
Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks.
One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks.
Recent research in speech dereverberation has shown that the optimal RF of a TCN varies with the reverberation characteristics of the speech signal.
arXiv Detail & Related papers (2022-10-27T10:29:19Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - Compute and memory efficient universal sound source separation [23.152611264259225]
We provide a family of efficient neural network architectures for general purpose audio source separation.
The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF)
Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks.
arXiv Detail & Related papers (2021-03-03T19:16:53Z) - Accurate and Lightweight Image Super-Resolution with Model-Guided Deep
Unfolding Network [63.69237156340457]
We present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN)
MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations)
The superiority of the proposed MoG-DUN method to existing state-of-theart image methods including RCAN, SRDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
arXiv Detail & Related papers (2020-09-14T08:23:37Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.