Related papers: DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

URL: http://arxiv.org/abs/2308.00122v1
Date: Mon, 31 Jul 2023 19:41:49 GMT
Title: DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models
Authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Abstract summary: DAVIS is a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset. Results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
Score: 49.62299756133055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

Related papers

Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio. We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks. In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z)
An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation [0.4893345190925179]
Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance.
arXiv Detail & Related papers (2024-10-28T06:18:12Z)
LAVSS: Location-Guided Audio-Visual Spatial Audio Separation [52.44052357829296]
We propose a location-guided audio-visual spatial audio separator. The proposed LAVSS is inspired by the correlation between spatial audio and visual location. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation.
arXiv Detail & Related papers (2023-10-31T13:30:24Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z)
Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like. We introduce weakly-supervised object segmentation in the context of sound separation. Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z)
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z)
Leveraging Category Information for Single-Frame Visual Sound Source Separation [15.26733033527393]
We study simple yet efficient models for visual sound separation using only a single video frame. Our models are able to exploit the information of the sound source category in the separation process.
arXiv Detail & Related papers (2020-07-15T20:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.