Multi-scale Multi-instance Visual Sound Localization and Segmentation
- URL: http://arxiv.org/abs/2409.00486v1
- Date: Sat, 31 Aug 2024 15:43:22 GMT
- Title: Multi-scale Multi-instance Visual Sound Localization and Segmentation
- Authors: Shentong Mo, Haofan Wang,
- Abstract summary: We propose a novel multi-scale visual sound localization framework, namely M2VSL.
M2VSL learns multi-scale semantic features associated with sound sources from the input image to localize sounding objects.
We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks.
- Score: 15.624453757710802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition [47.103732403296654]
Multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces.
We introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones.
Experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance.
arXiv Detail & Related papers (2023-09-29T20:48:44Z) - A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition [26.828874753756523]
We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition.
OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives.
Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
arXiv Detail & Related papers (2023-05-30T23:53:12Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - Audio-Visual Grouping Network for Sound Localization from Mixtures [30.756247389435803]
Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.
We propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and image.
Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources.
arXiv Detail & Related papers (2023-03-29T22:58:55Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Multiple Sound Sources Localization from Coarse to Fine [41.56420350529494]
How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
arXiv Detail & Related papers (2020-07-13T12:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.