A Hypernetwork-Based Approach to KAN Representation of Audio Signals
- URL: http://arxiv.org/abs/2503.02585v1
- Date: Tue, 04 Mar 2025 13:08:45 GMT
- Title: A Hypernetwork-Based Approach to KAN Representation of Audio Signals
- Authors: Patryk MarszaĆek, Maciej Rut, Piotr Kawa, Piotr Syga,
- Abstract summary: Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited.<n>This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation.<n>To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates.
- Score: 1.7499351967216343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
Related papers
- End-to-End Implicit Neural Representations for Classification [57.55927378696826]
Implicit neural representations (INRs) encode a signal in neural network parameters and show excellent results for signal reconstruction.
INR-based classification still significantly under-performs compared to pixel-based methods like CNNs.
This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme.
arXiv Detail & Related papers (2025-03-23T16:02:23Z) - Real-time Speech Enhancement on Raw Signals with Deep State-space Modeling [1.0650780147044159]
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement.
We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets.
The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency.
arXiv Detail & Related papers (2024-09-05T09:28:56Z) - LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement [4.891339883978289]
We propose long short term memory speech enhancement network (LSTMSE-Net)
This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals.
The system scales and highlights visual and audio features, then surpasses them through a separator network for optimized speech enhancement.
arXiv Detail & Related papers (2024-09-03T19:52:49Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - HyperSound: Generating Implicit Neural Representations of Audio Signals
with Hypernetworks [23.390919506056502]
Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals.
We propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time.
We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.
arXiv Detail & Related papers (2022-11-03T14:20:32Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.