Hypernetworks build Implicit Neural Representations of Sounds
- URL: http://arxiv.org/abs/2302.04959v3
- Date: Sat, 17 Jun 2023 09:41:53 GMT
- Title: Hypernetworks build Implicit Neural Representations of Sounds
- Authors: Filip Szatkowski, Karol J. Piczak, Przemys{\l}aw Spurek, Jacek Tabor,
Tomasz Trzci\'nski
- Abstract summary: Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering.
Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models.
We introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training.
Our approach reconstructs audio samples with quality comparable to other state
- Score: 18.28957270390735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Implicit Neural Representations (INRs) are nowadays used to represent
multimedia signals across various real-life applications, including image
super-resolution, image compression, or 3D rendering. Existing methods that
leverage INRs are predominantly focused on visual data, as their application to
other modalities, such as audio, is nontrivial due to the inductive biases
present in architectural attributes of image-based INR models. To address this
limitation, we introduce HyperSound, the first meta-learning approach to
produce INRs for audio samples that leverages hypernetworks to generalize
beyond samples observed in training. Our approach reconstructs audio samples
with quality comparable to other state-of-the-art models and provides a viable
alternative to contemporary sound representations used in deep neural networks
for audio processing, such as spectrograms.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Siamese SIREN: Audio Compression with Implicit Neural Representations [10.482805367361818]
Implicit Neural Representations (INRs) have emerged as a promising method for representing diverse data modalities.
We present a preliminary investigation into the use of INRs for audio compression.
Our study introduces Siamese SIREN, a novel approach based on the popular SIREN architecture.
arXiv Detail & Related papers (2023-06-22T15:16:06Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Modality-Agnostic Variational Compression of Implicit Neural
Representations [96.35492043867104]
We introduce a modality-agnostic neural compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR)
Bridging the gap between latent coding and sparsity, we obtain compact latent representations non-linearly mapped to a soft gating mechanism.
After obtaining a dataset of such latent representations, we directly optimise the rate/distortion trade-off in a modality-agnostic space using neural compression.
arXiv Detail & Related papers (2023-01-23T15:22:42Z) - HyperSound: Generating Implicit Neural Representations of Audio Signals
with Hypernetworks [23.390919506056502]
Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals.
We propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time.
We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.
arXiv Detail & Related papers (2022-11-03T14:20:32Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Multi-modal Residual Perceptron Network for Audio-Video Emotion
Recognition [0.22843885788439797]
We propose a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise.
For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4%.
arXiv Detail & Related papers (2021-07-21T13:11:37Z) - Adaptive Gradient Balancing for UndersampledMRI Reconstruction and
Image-to-Image Translation [60.663499381212425]
We enhance the image quality by using a Wasserstein Generative Adversarial Network combined with a novel Adaptive Gradient Balancing technique.
In MRI, our method minimizes artifacts, while maintaining a high-quality reconstruction that produces sharper images than other techniques.
arXiv Detail & Related papers (2021-04-05T13:05:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.