Related papers: CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

URL: http://arxiv.org/abs/2309.09709v2
Date: Wed, 20 Sep 2023 17:55:55 GMT
Title: CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Authors: Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao
Abstract summary: Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames. We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
Score: 43.562848631392384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at \url{https://github.com/aspirinone/CATR.github.io}

Related papers

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z)
Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval [58.640807985155554]
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to a given query.<n>Most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality.<n>We propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR.
arXiv Detail & Related papers (2025-08-06T09:58:43Z)
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z)
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation [39.38821481268827]
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio. Current methods focus more on object-level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment. We propose a Collaborative Hybrid Propagator Framework(Co-Prop) to address this issue.
arXiv Detail & Related papers (2024-12-11T07:33:18Z)
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis [28.172213291270868]
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A) presents inherent challenges related to audio-visual synchronization. We construct a controllable video-to-audio model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals.
arXiv Detail & Related papers (2024-09-10T01:07:20Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation [36.50512269898893]
To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required. We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information. Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
arXiv Detail & Related papers (2023-09-18T05:58:06Z)
Large-scale unsupervised audio pre-training for video-to-speech synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz. We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z)
Transavs: End-To-End Audio-Visual Segmentation With Transformer [33.56539999875508]
We propose TransAVS, the first Transformer-based end-to-end framework for Audio-Visual task. TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset.
arXiv Detail & Related papers (2023-05-12T03:31:04Z)
Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z)
Object Segmentation with Audio Context [0.5243460995467893]
This project explores the multimodal feature aggregation for video instance segmentation task. We integrate audio features into our video segmentation model to conduct an audio-visual learning scheme.
arXiv Detail & Related papers (2023-01-04T01:33:42Z)
Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS) The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z)
VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media. We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.