AVSegFormer: Audio-Visual Segmentation with Transformer
- URL: http://arxiv.org/abs/2307.01146v4
- Date: Mon, 18 Dec 2023 11:20:25 GMT
- Title: AVSegFormer: Audio-Visual Segmentation with Transformer
- Authors: Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
- Abstract summary: A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video.
This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges.
We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
- Score: 42.24135756439358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The combination of audio and vision has long been a topic of interest in the
multi-modal community. Recently, a new audio-visual segmentation (AVS) task has
been introduced, aiming to locate and segment the sounding objects in a given
video. This task demands audio-driven pixel-level scene understanding for the
first time, posing significant challenges. In this paper, we propose
AVSegFormer, a novel framework for AVS tasks that leverages the transformer
architecture. Specifically, we introduce audio queries and learnable queries
into the transformer decoder, enabling the network to selectively attend to
interested visual features. Besides, we present an audio-visual mixer, which
can dynamically adjust visual features by amplifying relevant and suppressing
irrelevant spatial channels. Additionally, we devise an intermediate mask loss
to enhance the supervision of the decoder, encouraging the network to produce
more accurate intermediate predictions. Extensive experiments demonstrate that
AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is
available at https://github.com/vvvb-github/AVSegFormer.
Related papers
- Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation [7.124066540020968]
Audio-Visual (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic (AVSS) pursues semantic understanding of audio-visual scenes.
Previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization.
We propose a two-stage training strategy called textitStepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization
arXiv Detail & Related papers (2024-07-16T15:08:30Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Weakly-Supervised Audio-Visual Segmentation [44.632423828359315]
We present a novel Weakly-Supervised Audio-Visual framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-instance contrastive learning for audio-visual segmentation.
Experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
arXiv Detail & Related papers (2023-11-25T17:18:35Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation [22.28510611697998]
We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task.
Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
arXiv Detail & Related papers (2023-07-25T03:59:04Z) - Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task.
TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks.
Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z) - Vision Transformers are Parameter-Efficient Audio-Visual Learners [95.59258503297195]
We propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks.
Our approach achieves competitive or even better performance on various audio-visual tasks.
arXiv Detail & Related papers (2022-12-15T17:31:54Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.