A Generalized Bandsplit Neural Network for Cinematic Audio Source
Separation
- URL: http://arxiv.org/abs/2309.02539v3
- Date: Fri, 1 Dec 2023 22:43:25 GMT
- Title: A Generalized Bandsplit Neural Network for Cinematic Audio Source
Separation
- Authors: Karn N. Watcharasupat, Chih-Wei Wu, Yiwei Ding, Iroro Orife, Aaron J.
Hipple, Phillip A. Williams, Scott Kramer, Alexander Lerch, and William
Wolcott
- Abstract summary: We develop a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis.
A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed.
Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
- Score: 39.45425155123186
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cinematic audio source separation is a relatively new subtask of audio source
separation, with the aim of extracting the dialogue, music, and effects stems
from their mixture. In this work, we developed a model generalizing the
Bandsplit RNN for any complete or overcomplete partitions of the frequency
axis. Psychoacoustically motivated frequency scales were used to inform the
band definitions which are now defined with redundancy for more reliable
feature extraction. A loss function motivated by the signal-to-noise ratio and
the sparsity-promoting property of the 1-norm was proposed. We additionally
exploit the information-sharing property of a common-encoder setup to reduce
computational complexity during both training and inference, improve separation
performance for hard-to-generalize classes of sounds, and allow flexibility
during inference time with detachable decoders. Our best model sets the state
of the art on the Divide and Remaster dataset with performance above the ideal
ratio mask for the dialogue stem.
Related papers
- Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation [0.0]
This study tackles the distinct separation of vocal components from musical spectrograms.
We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms.
We implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately.
arXiv Detail & Related papers (2024-05-30T13:47:53Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Sudo rm -rf: Efficient Networks for Universal Audio Source Separation [32.851407723043806]
We present an efficient neural network for end-to-end general purpose audio source separation.
The backbone structure of this network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF)
arXiv Detail & Related papers (2020-07-14T05:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.