Related papers: Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

URL: http://arxiv.org/abs/2506.23623v1
Date: Mon, 30 Jun 2025 08:40:36 GMT
Title: Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
Authors: Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang,
Abstract summary: Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
Score: 60.83798235788669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS.

Related papers

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.<n>Our framework incorporates two key components for video understanding and cross-modal learning.<n>Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z)
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z)
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames. We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z)
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation [22.28510611697998]
We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
arXiv Detail & Related papers (2023-07-25T03:59:04Z)
AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
Annotation-free Audio-Visual Segmentation [46.42570058385209]
We propose a novel pipeline for generating artificial data for the Audio-Visual task without extra manual annotations. We leverage existing image segmentation and audio datasets and match the image-mask pairs with its corresponding audio samples using category labels. We also introduce a lightweight model SAMA-AVS which adapts the pre-trained segment anything model(SAM) to the AVS task.
arXiv Detail & Related papers (2023-05-18T14:52:45Z)
Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task. TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z)
TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs. TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z)
Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS) The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.