A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition
- URL: http://arxiv.org/abs/2305.19458v1
- Date: Tue, 30 May 2023 23:53:12 GMT
- Title: A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition
- Authors: Shentong Mo, Pedro Morgado
- Abstract summary: We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition.
OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives.
Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
- Score: 26.828874753756523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to accurately recognize, localize and separate sound sources is
fundamental to any audio-visual perception task. Historically, these abilities
were tackled separately, with several methods developed independently for each
task. However, given the interconnected nature of source localization,
separation, and recognition, independent models are likely to yield suboptimal
performance as they fail to capture the interdependence between these tasks. To
address this problem, we propose a unified audio-visual learning framework
(dubbed OneAVM) that integrates audio and visual cues for joint localization,
separation, and recognition. OneAVM comprises a shared audio-visual encoder and
task-specific decoders trained with three objectives. The first objective
aligns audio and visual representations through a localized audio-visual
correspondence loss. The second tackles visual source separation using a
traditional mix-and-separate framework. Finally, the third objective reinforces
visual feature separation and localization by mixing images in pixel space and
aligning their representations with those of all corresponding sound sources.
Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound
datasets demonstrate the effectiveness of OneAVM for all three tasks,
audio-visual source localization, separation, and nearest neighbor recognition,
and empirically demonstrate a strong positive transfer between them.
Related papers
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source
Separation [36.38300120482868]
We present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation.
ASMP achieves a clear improvement in source separation quality, outperforming prior works on two challenging audio-visual datasets.
arXiv Detail & Related papers (2022-10-29T02:55:39Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.