Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware
Sound Separation
- URL: http://arxiv.org/abs/2310.11713v1
- Date: Wed, 18 Oct 2023 05:03:57 GMT
- Title: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware
Sound Separation
- Authors: Yiyang Su, Ali Vosoughi, Shijian Deng, Yapeng Tian, Chenliang Xu
- Abstract summary: This paper introduces a novel "Audio-Visual Scene-Aware Separation" framework.
It includes a semantic for visible and invisible sounds and a separator for scene-informed separation.
AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.
- Score: 51.06562260845748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The audio-visual sound separation field assumes visible sources in videos,
but this excludes invisible sounds beyond the camera's view. Current methods
struggle with such sounds lacking visible cues. This paper introduces a novel
"Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a
semantic parser for visible and invisible sounds and a separator for
scene-informed separation. AVSA-Sep successfully separates both sound types,
with joint training and cross-modal alignment enhancing effectiveness.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Egocentric Audio-Visual Noise Suppression [11.113020254726292]
This paper studies audio-visual noise suppression for egocentric videos.
Video camera emulates off-screen speaker's view of the outside world.
We first demonstrate that egocentric visual information is helpful for noise suppression.
arXiv Detail & Related papers (2022-11-07T15:53:12Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Self-Supervised Learning from Automatically Separated Sound Scenes [38.71803524843168]
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into semantically-linked views.
We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches.
arXiv Detail & Related papers (2021-05-05T15:37:17Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds [33.4237979175049]
We present AudioScope, a novel audio-visual sound separation framework.
It can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos.
We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data.
arXiv Detail & Related papers (2020-11-02T17:36:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.