Discovering Sounding Objects by Audio Queries for Audio Visual
Segmentation
- URL: http://arxiv.org/abs/2309.09501v1
- Date: Mon, 18 Sep 2023 05:58:06 GMT
- Title: Discovering Sounding Objects by Audio Queries for Audio Visual
Segmentation
- Authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han,
Wenge Rong, Si Liu
- Abstract summary: To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required.
We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information.
Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
- Score: 36.50512269898893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio visual segmentation (AVS) aims to segment the sounding objects for each
frame of a given video. To distinguish the sounding objects from silent ones,
both audio-visual semantic correspondence and temporal interaction are
required. The previous method applies multi-frame cross-modal attention to
conduct pixel-level interactions between audio features and visual features of
multiple frames simultaneously, which is both redundant and implicit. In this
paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we
define a set of object queries conditioned on audio information and associate
each of them to particular sounding objects. Explicit object-level semantic
correspondence between audio and visual modalities is established by gathering
object information from visual features with predefined audio queries. Besides,
an Audio-Bridged Temporal Interaction module is proposed to exchange sounding
object-relevant information among multiple frames with the bridge of audio
features. Extensive experiments are conducted on two AVS benchmarks to show
that our method achieves state-of-the-art performances, especially 7.1% M_J and
7.6% M_F gains on the MS3 setting.
Related papers
- Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder.
It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams.
Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task.
TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks.
Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.