Related papers: Dynamic Cross Attention for Audio-Visual Person Verification

Dynamic Cross Attention for Audio-Visual Person Verification

URL: http://arxiv.org/abs/2403.04661v3
Date: Mon, 22 Apr 2024 14:04:55 GMT
Title: Dynamic Cross Attention for Audio-Visual Person Verification
Authors: R. Gnana Praveen, Jahangir Alam,
Abstract summary: We propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model.
Score: 3.5803801804085347
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although person or identity verification has been predominantly explored using individual modalities such as face and voice, audio-visual fusion has recently shown immense potential to outperform unimodal approaches. Audio and visual modalities are often expected to pose strong complementary relationships, which plays a crucial role in effective audio-visual fusion. However, they may not always strongly complement each other, they may also exhibit weak complementary relationships, resulting in poor audio-visual feature representations. In this paper, we propose a Dynamic Cross-Attention (DCA) model that can dynamically select the cross-attended or unattended features on the fly based on the strong or weak complementary relationships, respectively, across audio and visual modalities. In particular, a conditional gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit strong complementary relationships, otherwise unattended features. Extensive experiments are conducted on the Voxceleb1 dataset to demonstrate the robustness of the proposed model. Results indicate that the proposed model consistently improves the performance on multiple variants of cross-attention while outperforming the state-of-the-art methods.

Related papers

Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection [54.10252086842123]
Multimodal Sentiment Analysis (MSA) aims to predict sentiment from language, acoustic, and visual data in videos.<n>This paper proposes a modality optimization and dynamic primary modality selection framework (MODS)<n>Experiments on four benchmark datasets demonstrate that MODS outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-11-09T11:13:32Z)
Semantic Item Graph Enhancement for Multimodal Recommendation [49.66272783945571]
Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items' multimodal information.<n>Prior methods often build modality-specific item-item semantic graphs from raw modality features.<n>These semantic graphs suffer from semantic deficiencies, including insufficient modeling of collaborative signals among items.
arXiv Detail & Related papers (2025-08-08T09:20:50Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR) MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
United we stand, Divided we fall: Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition in Valence-Arousal Space [3.1856756516735922]
We introduce Gated Recursive Joint Cross Attention (GRJCA) using a gating mechanism that can adaptively choose the most relevant features. The proposed approach improves the performance of RJCA model by adding more flexibility to deal with weak complementary relationships.
arXiv Detail & Related papers (2025-03-15T21:03:20Z)
Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition [3.1967132086545127]
Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. We propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly. Experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
arXiv Detail & Related papers (2024-05-21T15:11:35Z)
Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition [3.5803801804085347]
We propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets.
arXiv Detail & Related papers (2024-03-28T16:38:04Z)
Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention [3.5803801804085347]
We introduce a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework. We also explore BLSTMs to improve the temporal modeling of audio-visual feature representations. Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships.
arXiv Detail & Related papers (2024-03-07T16:57:45Z)
Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities. It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z)
Trusted Multi-View Classification with Dynamic Evidential Fusion [73.35990456162745]
We propose a novel multi-view classification algorithm, termed trusted multi-view classification (TMC) TMC provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. Both theoretical and experimental results validate the effectiveness of the proposed model in accuracy, robustness and trustworthiness.
arXiv Detail & Related papers (2022-04-25T03:48:49Z)
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features. Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z)
Self-attention fusion for audiovisual emotion recognition with incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition. We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z)
MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem. Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z)
Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images. We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.