Dynamic Cross Attention for Audio-Visual Person Verification
        - URL: http://arxiv.org/abs/2403.04661v3
- Date: Mon, 22 Apr 2024 14:04:55 GMT
- Title: Dynamic Cross Attention for Audio-Visual Person Verification
- Authors: R. Gnana Praveen, Jahangir Alam, 
- Abstract summary: We propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly.
In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism.
Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model.
- Score: 3.5803801804085347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods. 
 
      
        Related papers
        - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
 We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.
Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
 arXiv  Detail & Related papers  (2025-04-06T13:59:16Z)
- $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker   Extraction [80.57232374640911]
 We propose a model-agnostic strategy called the Mask-And-Recover (MAR)
MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.
To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
 arXiv  Detail & Related papers  (2025-04-01T13:01:30Z)
- United we stand, Divided we fall: Handling Weak Complementary   Relationships for Audio-Visual Emotion Recognition in Valence-Arousal Space [3.1856756516735922]
 We introduce Gated Recursive Joint Cross Attention (GRJCA) using a gating mechanism that can adaptively choose the most relevant features.
The proposed approach improves the performance of RJCA model by adding more flexibility to deal with weak complementary relationships.
 arXiv  Detail & Related papers  (2025-03-15T21:03:20Z)
- Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in   Dimensional Emotion Recognition [3.1967132086545127]
 Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition.
We propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly.
Experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
 arXiv  Detail & Related papers  (2024-05-21T15:11:35Z)
- Cross-Attention is Not Always Needed: Dynamic Cross-Attention for   Audio-Visual Dimensional Emotion Recognition [3.5803801804085347]
 We propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly.
We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets.
 arXiv  Detail & Related papers  (2024-03-28T16:38:04Z)
- Audio-Visual Person Verification based on Recursive Fusion of Joint   Cross-Attention [3.5803801804085347]
 We introduce a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework.
We also explore BLSTMs to improve the temporal modeling of audio-visual feature representations.
Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships.
 arXiv  Detail & Related papers  (2024-03-07T16:57:45Z)
- Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
 This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
 arXiv  Detail & Related papers  (2023-05-25T15:26:13Z)
- Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
  Using Joint Cross-Attention [15.643176705932396]
 We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
 arXiv  Detail & Related papers  (2022-09-19T15:01:55Z)
- Trusted Multi-View Classification with Dynamic Evidential Fusion [73.35990456162745]
 We propose a novel multi-view classification algorithm, termed trusted multi-view classification (TMC)
TMC provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
Both theoretical and experimental results validate the effectiveness of the proposed model in accuracy, robustness and trustworthiness.
 arXiv  Detail & Related papers  (2022-04-25T03:48:49Z)
- A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional   Emotion Recognition [46.443866373546726]
 We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos.
We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features.
Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
 arXiv  Detail & Related papers  (2022-03-28T14:09:43Z)
- Self-attention fusion for audiovisual emotion recognition with
  incomplete data [103.70855797025689]
 We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
 arXiv  Detail & Related papers  (2022-01-26T18:04:29Z)
- MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
 We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
 arXiv  Detail & Related papers  (2021-01-11T02:57:25Z)
- Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
  Re-Identification [208.1227090864602]
 Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
 arXiv  Detail & Related papers  (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.