Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation
- URL: http://arxiv.org/abs/2408.03588v2
- Date: Mon, 26 Aug 2024 00:52:40 GMT
- Title: Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation
- Authors: Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife,
- Abstract summary: Cinematic audio source separation is a new subtask of audio source separation.
A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue (DX), music (MX), and effects (FX) stems.
We demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem.
- Score: 5.926447149127937
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cinematic audio source separation (CASS), as a standalone problem of extracting individual stems from their mixture, is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue (DX), music (MX), and effects (FX) stems. Given the creative nature of cinematic sound production, however, several edge cases exist; some sound sources do not fit neatly in any of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX or neither, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.
Related papers
- A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems [53.30852012059025]
Banquet is a system that allows source separation of multiple stems using just one decoder.
A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model.
arXiv Detail & Related papers (2024-06-26T20:25:53Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - A Generalized Bandsplit Neural Network for Cinematic Audio Source
Separation [39.45425155123186]
We develop a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis.
A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed.
Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
arXiv Detail & Related papers (2023-09-05T19:19:22Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Task 1A DCASE 2021: Acoustic Scene Classification with mismatch-devices
using squeeze-excitation technique and low-complexity constraint [4.4973334555746]
Acoustic scene classification (ASC) is one of the most popular problems in the field of machine listening.
The subtask presented in this report corresponds to a ASC problem that is constrained by the complexity of the model.
Specifically, a system based on two steps is proposed: a two-dimensional representation of the audio using the Gamamtone filter bank and a convolutional neural network.
arXiv Detail & Related papers (2021-07-30T14:24:45Z) - A cappella: Audio-visual Singing Voice Separation [4.6453787256723365]
We explore the single-channel singing voice separation problem from a multimodal perspective.
We present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube.
We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results.
arXiv Detail & Related papers (2021-04-20T13:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.