HyperSound: Generating Implicit Neural Representations of Audio Signals
with Hypernetworks
- URL: http://arxiv.org/abs/2211.01839v2
- Date: Thu, 25 Jan 2024 16:49:33 GMT
- Title: HyperSound: Generating Implicit Neural Representations of Audio Signals
with Hypernetworks
- Authors: Filip Szatkowski, Karol J. Piczak, Przemys{\l}aw Spurek, Jacek Tabor,
Tomasz Trzci\'nski
- Abstract summary: Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals.
We propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time.
We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.
- Score: 23.390919506056502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Implicit neural representations (INRs) are a rapidly growing research field,
which provides alternative ways to represent multimedia signals. Recent
applications of INRs include image super-resolution, compression of
high-dimensional signals, or 3D rendering. However, these solutions usually
focus on visual data, and adapting them to the audio domain is not trivial.
Moreover, it requires a separately trained model for every data sample. To
address this limitation, we propose HyperSound, a meta-learning method
leveraging hypernetworks to produce INRs for audio signals unseen at training
time. We show that our approach can reconstruct sound waves with quality
comparable to other state-of-the-art models.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - High-Fidelity Audio Compression with Improved RVQGAN [49.7859037103693]
We introduce a high-fidelity universal neural audio compression algorithm that achieves 90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth.
We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio.
arXiv Detail & Related papers (2023-06-11T00:13:00Z) - Hypernetworks build Implicit Neural Representations of Sounds [18.28957270390735]
Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering.
Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models.
We introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training.
Our approach reconstructs audio samples with quality comparable to other state
arXiv Detail & Related papers (2023-02-09T22:24:26Z) - Modality-Agnostic Variational Compression of Implicit Neural
Representations [96.35492043867104]
We introduce a modality-agnostic neural compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR)
Bridging the gap between latent coding and sparsity, we obtain compact latent representations non-linearly mapped to a soft gating mechanism.
After obtaining a dataset of such latent representations, we directly optimise the rate/distortion trade-off in a modality-agnostic space using neural compression.
arXiv Detail & Related papers (2023-01-23T15:22:42Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Towards Lightweight Controllable Audio Synthesis with Conditional
Implicit Neural Representations [10.484851004093919]
Implicit neural representations (INRs) are neural networks used to approximate low-dimensional functions.
In this work we shed light on the potential of Conditional Implicit Neural Representations (CINRs) as lightweight backbones in generative frameworks for audio synthesis.
arXiv Detail & Related papers (2021-11-14T13:36:18Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Multi-modal Residual Perceptron Network for Audio-Video Emotion
Recognition [0.22843885788439797]
We propose a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise.
For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4%.
arXiv Detail & Related papers (2021-07-21T13:11:37Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.