Audio-Visual Segmentation with Semantics
- URL: http://arxiv.org/abs/2301.13190v1
- Date: Mon, 30 Jan 2023 18:53:32 GMT
- Title: Audio-Visual Segmentation with Semantics
- Authors: Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun,
Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong
- Abstract summary: We propose a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark, AVSBench, providing pixel-wise annotations for sounding objects in audible videos.
- Score: 45.5917563087477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object(s) that produce sound at the
time of the image frame. To facilitate this research, we construct the first
audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise
annotations for sounding objects in audible videos. It contains three subsets:
AVSBench-object (Single-source subset, Multi-sources subset) and
AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are
studied: 1) semi-supervised audio-visual segmentation with a single sound
source; 2) fully-supervised audio-visual segmentation with multiple sound
sources, and 3) fully-supervised audio-visual semantic segmentation. The first
two settings need to generate binary masks of sounding objects indicating
pixels corresponding to the audio, while the third setting further requires
generating semantic maps indicating the object category. To deal with these
problems, we propose a new baseline method that uses a temporal pixel-wise
audio-visual interaction module to inject audio semantics as guidance for the
visual segmentation process. We also design a regularization loss to encourage
audio-visual mapping during training. Quantitative and qualitative experiments
on AVSBench compare our approach to several existing methods for related tasks,
demonstrating that the proposed method is promising for building a bridge
between the audio and pixel-wise visual semantics. Code is available at
https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at
http://www.avlbench.opennlplab.cn.
Related papers
- Can Textual Semantics Mitigate Sounding Object Segmentation Preference? [10.368382203643739]
We argue that audio lacks robust semantics compared to vision, resulting in weak audio guidance over the visual space.
Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance.
Our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets.
arXiv Detail & Related papers (2024-07-15T17:45:20Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Discovering Sounding Objects by Audio Queries for Audio Visual
Segmentation [36.50512269898893]
To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required.
We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information.
Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
arXiv Detail & Related papers (2023-09-18T05:58:06Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework.
In the first stage, we employ a segmentation model to localize potential sounding objects from visual data.
In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z) - Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task.
TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks.
Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.