A Multi-dimensional Deep Structured State Space Approach to Speech
Enhancement Using Small-footprint Models
- URL: http://arxiv.org/abs/2306.00331v1
- Date: Thu, 1 Jun 2023 04:19:57 GMT
- Title: A Multi-dimensional Deep Structured State Space Approach to Speech
Enhancement Using Small-footprint Models
- Authors: Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee
- Abstract summary: We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains.
The proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation.
- Score: 45.90759340302879
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a multi-dimensional structured state space (S4) approach to speech
enhancement. To better capture the spectral dependencies across the frequency
axis, we focus on modifying the multi-dimensional S4 layer with whitening
transformation to build new small-footprint models that also achieve good
performance. We explore several S4-based deep architectures in time (T) and
time-frequency (TF) domains. The 2-D S4 layer can be considered a particular
convolutional layer with an infinite receptive field although it utilizes fewer
parameters than a conventional convolutional layer. Evaluated on the
VoiceBank-DEMAND data set, when compared with the conventional U-net model
based on convolutional layers, the proposed TF-domain S4-based model is 78.6%
smaller in size, yet it still achieves competitive results with a PESQ score of
3.15 with data augmentation. By increasing the model size, we can even reach a
PESQ score of 3.18.
Related papers
- Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution [47.12618295041499]
We propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR.<n>We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget.<n> Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings.
arXiv Detail & Related papers (2026-02-01T15:07:59Z) - Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models [79.06910348413861]
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image.<n>Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion.
arXiv Detail & Related papers (2025-11-01T11:16:25Z) - State-Free Inference of State-Space Models: The Transfer Function Approach [132.83348321603205]
State-free inference does not incur any significant memory or computational cost with an increase in state size.
We achieve this using properties of the proposed frequency domain transfer function parametrization.
We report improved perplexity in language modeling over a long convolutional Hyena baseline.
arXiv Detail & Related papers (2024-05-10T00:06:02Z) - Augmenting conformers with structured state-space sequence models for
online speech recognition [41.444671189679994]
Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems.
In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4)
We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions.
Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
arXiv Detail & Related papers (2023-09-15T17:14:17Z) - Neural Networks at a Fraction with Pruned Quaternions [0.0]
Pruning is one technique to remove unnecessary weights and reduce resource requirements for training and inference.
For ML tasks where the input data is multi-dimensional, using higher-dimensional data embeddings such as complex numbers or quaternions has been shown to reduce the parameter count while maintaining accuracy.
We find that for some architectures, at very high sparsity levels, quaternion models provide higher accuracies than their real counterparts.
arXiv Detail & Related papers (2023-08-13T14:25:54Z) - Probabilistic-based Feature Embedding of 4-D Light Fields for
Compressive Imaging and Denoising [62.347491141163225]
4-D light field (LF) poses great challenges in achieving efficient and effective feature embedding.
We propose a probabilistic-based feature embedding (PFE), which learns a feature embedding architecture by assembling various low-dimensional convolution patterns.
Our experiments demonstrate the significant superiority of our methods on both real-world and synthetic 4-D LF images.
arXiv Detail & Related papers (2023-06-15T03:46:40Z) - Liquid Structural State-Space Models [106.74783377913433]
Liquid-S4 achieves an average performance of 87.32% on the Long-Range Arena benchmark.
On the full raw Speech Command recognition, dataset Liquid-S4 achieves 96.78% accuracy with a 30% reduction in parameter counts compared to S4.
arXiv Detail & Related papers (2022-09-26T18:37:13Z) - Simplified State Space Layers for Sequence Modeling [11.215817688691194]
Recently, models using structured state space sequence layers achieved state-of-the-art performance on many long-range tasks.
We revisit the idea that closely following the HiPPO framework is necessary for high performance.
We replace the bank of many independent single-input, single-output (SISO) SSMs the S4 layer uses with one multi-input, multi-output (MIMO) SSM.
S5 matches S4's performance on long-range tasks, including achieving an average of 82.46% on the suite of Long Range Arena benchmarks.
arXiv Detail & Related papers (2022-08-09T17:57:43Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - Diagonal State Spaces are as Effective as Structured State Spaces [3.8276199743296906]
We show that our $textitDiagonal State Space$ (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement.
In this work, we show that one can match the performance of S4 even without the low rank correction and thus assuming the state matrices to be diagonal.
arXiv Detail & Related papers (2022-03-27T16:30:33Z) - Real-time Ionospheric Imaging of S4 Scintillation from Limited Data with
Parallel Kalman Filters and Smoothness [91.3755431537592]
We create two dimensional ionospheric images of S4 amplitude scintillation at 350 km over South America with temporal resolution of one minute.
Our results show that in areas with a network of ground receivers with a relatively good coverage the produced images can provide reliable real-time results.
arXiv Detail & Related papers (2021-05-11T23:09:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.