CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation
- URL: http://arxiv.org/abs/2309.09709v2
- Date: Wed, 20 Sep 2023 17:55:55 GMT
- Title: CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation
- Authors: Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao
- Abstract summary: Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
- Score: 43.562848631392384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of
sound-producing objects within image frames and ensure the maps faithfully
adhere to the given audio, such as identifying and segmenting a singing person
in a video. However, existing methods exhibit two limitations: 1) they address
video temporal features and audio-visual interactive features separately,
disregarding the inherent spatial-temporal dependence of combined audio and
video, and 2) they inadequately introduce audio constraints and object-level
information during the decoding stage, resulting in segmentation outcomes that
fail to comply with audio directives. To tackle these issues, we propose a
decoupled audio-video transformer that combines audio and video features from
their respective temporal and spatial dimensions, capturing their combined
dependence. To optimize memory consumption, we design a block, which, when
stacked, enables capturing audio-visual fine-grained combinatorial-dependence
in a memory-efficient manner. Additionally, we introduce audio-constrained
queries during the decoding phase. These queries contain rich object-level
information, ensuring the decoded mask adheres to the sounds. Experimental
results confirm our approach's effectiveness, with our framework achieving a
new SOTA performance on all three datasets using two backbones. The code is
available at \url{https://github.com/aspirinone/CATR.github.io}
Related papers
- Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis [28.172213291270868]
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience.
Video-to-Audio (V2A) presents inherent challenges related to audio-visual synchronization.
We construct a controllable video-to-audio model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals.
arXiv Detail & Related papers (2024-09-10T01:07:20Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Discovering Sounding Objects by Audio Queries for Audio Visual
Segmentation [36.50512269898893]
To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required.
We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information.
Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
arXiv Detail & Related papers (2023-09-18T05:58:06Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task.
TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks.
Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Object Segmentation with Audio Context [0.5243460995467893]
This project explores the multimodal feature aggregation for video instance segmentation task.
We integrate audio features into our video segmentation model to conduct an audio-visual learning scheme.
arXiv Detail & Related papers (2023-01-04T01:33:42Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.