An Investigation on the Potential of KAN in Speech Enhancement
- URL: http://arxiv.org/abs/2412.17778v1
- Date: Mon, 23 Dec 2024 18:38:32 GMT
- Title: An Investigation on the Potential of KAN in Speech Enhancement
- Authors: Haoyang Li, Yuchen Hu, Chen Chen, Eng Siong Chng,
- Abstract summary: Kolmogorov-Arnold Networks (KAN) is an emerging methodology that employs learnable activation functions on graph edges.
This work investigates two novel KAN variants based on rational and radial basis functions for speech enhancement.
- Score: 41.41193004912057
- License:
- Abstract: High-fidelity speech enhancement often requires sophisticated modeling to capture intricate, multiscale patterns. Standard activation functions, while introducing nonlinearity, lack the flexibility to fully address this complexity. Kolmogorov-Arnold Networks (KAN), an emerging methodology that employs learnable activation functions on graph edges, present a promising alternative. This work investigates two novel KAN variants based on rational and radial basis functions for speech enhancement. We integrate the rational variant into the 1D CNN blocks of Demucs and the GRU-Transformer blocks of MP-SENet, while the radial variant is adapted to the 2D CNN-based decoders of MP-SENet. Experiments on the VoiceBank-DEMAND dataset show that replacing standard activations with KAN-based activations improves speech quality across both the time-domain and time-frequency domain methods with minimal impact on model size and FLOP, underscoring KAN's potential to improve speech enhancement models.
Related papers
- Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling [39.80957479349776]
We investigate the prosody modeling capabilities of the discrete space of an RVQ-VAE model, modifying it to operate on the phoneme-level.
We show that the phoneme-level discrete latent representations achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable.
arXiv Detail & Related papers (2024-09-13T09:27:05Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Dynamic Kernels and Channel Attention with Multi-Layer Embedding
Aggregation for Speaker Verification [28.833851817220616]
This paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network.
The proposed dynamic convolutional model achieved 1.62% EER and 0.18 miniDCF on the VoxCeleb1 test set and has a 17% relative improvement compared to the ECAPA-TDNN.
arXiv Detail & Related papers (2022-11-03T17:13:28Z) - A Multimodal Canonical-Correlated Graph Neural Network for
Energy-Efficient Speech Enhancement [4.395837214164745]
This paper proposes a novel multimodal self-supervised architecture for energy-efficient AV speech enhancement.
It integrates graph neural networks with canonical correlation analysis (CCA-GNN)
Experiments conducted with the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN reinforces better feature learning in the temporal context.
arXiv Detail & Related papers (2022-02-09T15:47:07Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Rate Distortion Characteristic Modeling for Neural Image Compression [59.25700168404325]
End-to-end optimization capability offers neural image compression (NIC) superior lossy compression performance.
distinct models are required to be trained to reach different points in the rate-distortion (R-D) space.
We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep network and statistical modeling.
arXiv Detail & Related papers (2021-06-24T12:23:05Z) - An Investigation of Potential Function Designs for Neural CRF [75.79555356970344]
In this paper, we investigate a series of increasingly expressive potential functions for neural CRF models.
Our experiments show that the decomposed quadrilinear potential function based on the vector representations of two neighboring labels and two neighboring words consistently achieves the best performance.
arXiv Detail & Related papers (2020-11-11T07:32:18Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.