Open-Vocabulary Audio-Visual Semantic Segmentation
- URL: http://arxiv.org/abs/2407.21721v1
- Date: Wed, 31 Jul 2024 16:14:09 GMT
- Title: Open-Vocabulary Audio-Visual Semantic Segmentation
- Authors: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying,
- Abstract summary: We introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space.
We propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models.
- Score: 12.330521982730836
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation [7.124066540020968]
Audio-Visual (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic (AVSS) pursues semantic understanding of audio-visual scenes.
Previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization.
We propose a two-stage training strategy called textitStepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization
arXiv Detail & Related papers (2024-07-16T15:08:30Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy [26.975596225131824]
We propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task.
Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training.
arXiv Detail & Related papers (2022-04-25T03:37:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.