Related papers: Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

URL: http://arxiv.org/abs/2008.06581v1
Date: Fri, 14 Aug 2020 21:50:26 GMT
Title: Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Authors: Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, Yan Yan
Abstract summary: The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. We propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization.
Score: 25.883429290596556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization. Particularly, we present a concise yet valid architecture that effectively learns representations from multiple modalities in a joint manner. Initially, visual features are combined with auditory features and then turned into joint representations. Next, we make use of the joint representations to attend to visual features and auditory features, respectively. With the help of this joint co-attention, new visual and auditory features are produced, and thus both features can enjoy the mutually improved benefits from each other. It is worth noting that the joint co-attention unit is recursive meaning that it can be performed multiple times for obtaining better joint representations progressively. Extensive experiments on the public AVE dataset have shown that the proposed method achieves significantly better results than the state-of-the-art methods.

Related papers

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z)
Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention [3.5803801804085347]
We introduce a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework. We also explore BLSTMs to improve the temporal modeling of audio-visual feature representations. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships.
arXiv Detail & Related papers (2024-03-07T16:57:45Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling [24.346868432774453]
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. We address training early fusion architectures by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. We propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions.
arXiv Detail & Related papers (2023-12-02T03:38:49Z)
Audio-Visual Speaker Verification via Joint Cross-Attention [4.229744884478575]
Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
arXiv Detail & Related papers (2023-09-28T16:25:29Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities. It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z)
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features. Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z)
Audio-visual speech separation based on joint feature representation with cross-modal attention [45.210105822471256]
This study is inspired by learning joint feature representations from audio and visual streams with attention mechanism. To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated. The overall improvement of the performance has demonstrated that the additional motion network effectively enhances the visual representation of the combined lip images and audio signal.
arXiv Detail & Related papers (2022-03-05T04:39:46Z)
Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem. The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other. Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.