BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge
- URL: http://arxiv.org/abs/2308.10175v1
- Date: Sun, 20 Aug 2023 06:48:08 GMT
- Title: BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge
- Authors: Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, and
Xin Yu
- Abstract summary: We propose a two-stage bootstrapping audio-visual segmentation framework.
In the first stage, we employ a segmentation model to localize potential sounding objects from visual data.
In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
- Score: 43.92428145744478
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate
sounding sources by predicting pixel-wise maps. Previous methods assume that
each sound component in an audio signal always has a visual counterpart in the
image. However, this assumption overlooks that off-screen sounds and background
noise often contaminate the audio recordings in real-world scenarios. They
impose significant challenges on building a consistent semantic mapping between
audio and visual signals for AVS models and thus impede precise sound
localization. In this work, we propose a two-stage bootstrapping audio-visual
segmentation framework by incorporating multi-modal foundation knowledge. In a
nutshell, our BAVS is designed to eliminate the interference of background
noise or off-screen sounds in segmentation by establishing the audio-visual
correspondences in an explicit manner. In the first stage, we employ a
segmentation model to localize potential sounding objects from visual data
without being affected by contaminated audio signals. Meanwhile, we also
utilize a foundation audio classification model to discern audio semantics.
Considering the audio tags provided by the audio foundation model are noisy,
associating object masks with audio tags is not trivial. Thus, in the second
stage, we develop an audio-visual semantic integration strategy (AVIS) to
localize the authentic-sounding objects. Here, we construct an audio-visual
tree based on the hierarchical correspondence between sounds and object
categories. We then examine the label concurrency between the localized objects
and classified audio tags by tracing the audio-visual tree. With AVIS, we can
effectively segment real-sounding objects. Extensive experiments demonstrate
the superiority of our method on AVS datasets, particularly in scenarios
involving background noise. Our project website is
https://yenanliu.github.io/AVSS.github.io/.
Related papers
- Can Textual Semantics Mitigate Sounding Object Segmentation Preference? [10.368382203643739]
We argue that audio lacks robust semantics compared to vision, resulting in weak audio guidance over the visual space.
Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance.
Our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.
arXiv Detail & Related papers (2024-07-15T17:45:20Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics [26.473529162341837]
We present an audio-visual instance-aware segmentation approach to overcome the dataset bias.
Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio.
Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
arXiv Detail & Related papers (2023-07-31T12:56:30Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Audio-Visual Segmentation with Semantics [45.5917563087477]
We propose a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark, AVSBench, providing pixel-wise annotations for sounding objects in audible videos.
arXiv Detail & Related papers (2023-01-30T18:53:32Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.