DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion
Models
- URL: http://arxiv.org/abs/2308.00122v1
- Date: Mon, 31 Jul 2023 19:41:49 GMT
- Title: DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion
Models
- Authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
- Abstract summary: DAVIS is a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset.
Results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
- Score: 49.62299756133055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework
that solves the audio-visual sound source separation task through a generative
manner. While existing discriminative methods that perform mask regression have
made remarkable progress in this field, they face limitations in capturing the
complex data distribution required for high-quality separation of sounds from
diverse categories. In contrast, DAVIS leverages a generative diffusion model
and a Separation U-Net to synthesize separated magnitudes starting from
Gaussian noises, conditioned on both the audio mixture and the visual footage.
With its generative objective, DAVIS is better suited to achieving the goal of
high-quality sound separation across diverse categories. We compare DAVIS to
existing state-of-the-art discriminative audio-visual separation methods on the
domain-specific MUSIC dataset and the open-domain AVE dataset, and results show
that DAVIS outperforms other methods in separation quality, demonstrating the
advantages of our framework for tackling the audio-visual source separation
task.
Related papers
- Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.
We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.
In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation [0.4893345190925179]
Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals.
This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance.
arXiv Detail & Related papers (2024-10-28T06:18:12Z) - LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator.
The proposed LAVSS is inspired by the correlation between spatial audio and visual location.
In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z) - Leveraging Category Information for Single-Frame Visual Sound Source
Separation [15.26733033527393]
We study simple yet efficient models for visual sound separation using only a single video frame.
Our models are able to exploit the information of the sound source category in the separation process.
arXiv Detail & Related papers (2020-07-15T20:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.