Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
- URL: http://arxiv.org/abs/2307.16620v2
- Date: Tue, 1 Aug 2023 01:40:17 GMT
- Title: Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
- Authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang,
Xin Yu
- Abstract summary: We present an audio-visual instance-aware segmentation approach to overcome the dataset bias.
Our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio.
Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
- Score: 26.473529162341837
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The audio-visual segmentation (AVS) task aims to segment sounding objects
from a given video. Existing works mainly focus on fusing audio and visual
features of a given video to achieve sounding object masks. However, we
observed that prior arts are prone to segment a certain salient object in a
video regardless of the audio information. This is because sounding objects are
often the most salient ones in the AVS dataset. Thus, current AVS methods might
fail to localize genuine sounding objects due to the dataset bias. In this
work, we present an audio-visual instance-aware segmentation approach to
overcome the dataset bias. In a nutshell, our method first localizes potential
sounding objects in a video by an object segmentation network, and then
associates the sounding object candidates with the given audio. We notice that
an object could be a sounding object in one video but a silent one in another
video. This would bring ambiguity in training our object segmentation network
as only sounding objects have corresponding segmentation masks. We thus propose
a silent object-aware segmentation objective to alleviate the ambiguity.
Moreover, since the category information of audio is unknown, especially for
multiple sounding sources, we propose to explore the audio-visual semantic
correlation and then associate audio with potential objects. Specifically, we
attend predicted audio category scores to potential instance masks and these
scores will highlight corresponding sounding instances while suppressing
inaudible ones. When we enforce the attended instance masks to resemble the
ground-truth mask, we are able to establish audio-visual semantics correlation.
Experimental results on the AVS benchmarks demonstrate that our method can
effectively segment sounding objects without being biased to salient objects.
Related papers
- Can Textual Semantics Mitigate Sounding Object Segmentation Preference? [10.368382203643739]
We argue that audio lacks robust semantics compared to vision, resulting in weak audio guidance over the visual space.
Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance.
Our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.
arXiv Detail & Related papers (2024-07-15T17:45:20Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework.
In the first stage, we employ a segmentation model to localize potential sounding objects from visual data.
In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Audio-Visual Segmentation with Semantics [45.5917563087477]
We propose a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark, AVSBench, providing pixel-wise annotations for sounding objects in audible videos.
arXiv Detail & Related papers (2023-01-30T18:53:32Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.