SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling
- URL: http://arxiv.org/abs/2602.01394v1
- Date: Sun, 01 Feb 2026 18:57:53 GMT
- Title: SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling
- Authors: Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya,
- Abstract summary: This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise.<n>Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources.<n>We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in acWER across all conditions.
- Score: 23.130313134690443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/
Related papers
- ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z) - High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling [65.02357548201188]
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning.<n>Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information.
arXiv Detail & Related papers (2025-09-26T08:46:00Z) - Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior [24.815262863931334]
We propose a generative unsupervised technique that models both clean speech and structured noise components.<n>Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior.<n> Experimental results demonstrate promising performance, highlighting the effectiveness of our direct noise modelling approach.
arXiv Detail & Related papers (2025-09-17T19:25:35Z) - SEED: Speaker Embedding Enhancement Diffusion Model [27.198463567915386]
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch.<n>We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings.<n>Our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios.
arXiv Detail & Related papers (2025-05-22T15:38:37Z) - Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding [51.311553815466446]
We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints.<n>Given a pre-acquired recording of the scene from sparsely distributed microphones, SoundVista can synthesize the sound of that scene from an unseen target viewpoint.
arXiv Detail & Related papers (2025-04-08T00:22:16Z) - Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes [16.530816405275715]
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene.<n>Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing.
arXiv Detail & Related papers (2025-03-24T16:56:04Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.