Prompting Segmentation with Sound Is Generalizable Audio-Visual Source
Localizer
- URL: http://arxiv.org/abs/2309.07929v3
- Date: Fri, 2 Feb 2024 08:02:35 GMT
- Title: Prompting Segmentation with Sound Is Generalizable Audio-Visual Source
Localizer
- Authors: Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li
- Abstract summary: We introduce the encoder-prompt-decoder paradigm to decode localization from the fused audio-visual feature.
Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects.
We develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model.
- Score: 22.846623384472377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Never having seen an object and heard its sound simultaneously, can the model
still accurately localize its visual position from the input audio? In this
work, we concentrate on the Audio-Visual Localization and Segmentation tasks
but under the demanding zero-shot and few-shot scenarios. To achieve this goal,
different from existing approaches that mostly employ the
encoder-fusion-decoder paradigm to decode localization information from the
fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm,
aiming to better fit the data scarcity and varying data distribution dilemmas
with the help of abundant knowledge from pre-trained models. Specifically, we
first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual
foundation model focus on sounding objects, meanwhile, the semantic gap between
the visual and audio modalities is also encouraged to shrink. Then, we develop
a Correlation Adapter (ColA) to keep minimal training efforts as well as
maintain adequate knowledge of the visual foundation model. By equipping with
these means, extensive experiments demonstrate that this new paradigm
outperforms other fusion-based methods in both the unseen class and
cross-dataset settings. We hope that our work can further promote the
generalization study of Audio-Visual Localization and Segmentation in practical
application scenarios.
Related papers
- Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.
This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.
We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
Interactions through Masked Modeling [24.346868432774453]
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment.
This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models.
We address training early fusion architectures by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion.
We propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions.
arXiv Detail & Related papers (2023-12-02T03:38:49Z) - Estimating Visual Information From Audio Through Manifold Learning [14.113590443352495]
We propose a new framework for extracting visual information about a scene only using audio signals.
Our framework is based on Manifold Learning and consists of two steps.
We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset.
arXiv Detail & Related papers (2022-08-03T20:47:11Z) - Contrastive Learning of Global and Local Audio-Visual Representations [25.557229705149577]
We propose a versatile self-supervised approach to learn audio-visual representations that generalizes to tasks that require global semantic information.
We show that our approach learns generalizable video representations on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.
arXiv Detail & Related papers (2021-04-07T07:35:08Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.