Learning Branched Fusion and Orthogonal Projection for Face-Voice
Association
- URL: http://arxiv.org/abs/2208.10238v1
- Date: Mon, 22 Aug 2022 12:23:09 GMT
- Title: Learning Branched Fusion and Orthogonal Projection for Face-Voice
Association
- Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan, Sajid Javed,
Muhammad Haroon Yousaf, Alessio Del Bue
- Abstract summary: We propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings.
Results reveal that our method performs favourably against the current state-of-the-art methods.
In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association.
- Score: 20.973188176888865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen an increased interest in establishing association
between faces and voices of celebrities leveraging audio-visual information
from YouTube. Prior works adopt metric learning methods to learn an embedding
space that is amenable for associated matching and verification tasks. Albeit
showing some progress, such formulations are, however, restrictive due to
dependency on distance-dependent margin parameter, poor run-time training
complexity, and reliance on carefully crafted negative mining procedures. In
this work, we hypothesize that an enriched representation coupled with an
effective yet efficient supervision is important towards realizing a
discriminative joint embedding space for face-voice association tasks. To this
end, we propose a light-weight, plug-and-play mechanism that exploits the
complementary cues in both modalities to form enriched fused embeddings and
clusters them based on their identity labels via orthogonality constraints. We
coin our proposed mechanism as fusion and orthogonal projection (FOP) and
instantiate in a two-stream network. The overall resulting framework is
evaluated on VoxCeleb1 and MAV-Celeb datasets with a multitude of tasks,
including cross-modal verification and matching. Results reveal that our method
performs favourably against the current state-of-the-art methods and our
proposed formulation of supervision is more effective and efficient than the
ones employed by the contemporary methods. In addition, we leverage cross-modal
verification and matching tasks to analyze the impact of multiple languages on
face-voice association. Code is available:
\url{https://github.com/msaadsaeed/FOP}
Related papers
- Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and
Authentication [50.017055360261665]
We introduce DiffVein, a unified diffusion model-based framework which simultaneously addresses vein segmentation and authentication tasks.
For better feature interaction between these two branches, we introduce two specialized modules.
In this way, our framework allows for a dynamic interplay between diffusion and segmentation embeddings.
arXiv Detail & Related papers (2024-02-03T06:49:42Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Multi-scale Target-Aware Framework for Constrained Image Splicing
Detection and Localization [11.803255600587308]
We propose a multi-scale target-aware framework to couple feature extraction and correlation matching in a unified pipeline.
Our approach can effectively promote the collaborative learning of related patches, and perform mutual promotion of feature learning and correlation matching.
Our experiments demonstrate that our model, which uses a unified pipeline, outperforms state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2023-08-18T07:38:30Z) - Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement [53.044703127757295]
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims at learning modality-invariant features from unlabeled cross-modality dataset.
We propose a Dual Optimal Transport Label Assignment (DOTLA) framework to simultaneously assign the generated labels from one modality to its counterpart modality.
The proposed DOTLA mechanism formulates a mutual reinforcement and efficient solution to cross-modality data association, which could effectively reduce the side-effects of some insufficient and noisy label associations.
arXiv Detail & Related papers (2023-05-22T04:40:30Z) - Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z) - A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos.
We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features.
Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z) - Fusion and Orthogonal Projection for Improved Face-Voice Association [15.938463726577128]
We study the problem of learning association between face and voice.
We propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings.
arXiv Detail & Related papers (2021-12-20T12:33:33Z) - Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition [13.994609732846344]
Most effective techniques for emotion recognition efficiently leverage diverse and complimentary sources of information.
We introduce a cross-attentional fusion approach to extract the salient features across audio-visual (A-V) modalities.
Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-11-09T16:01:56Z) - Cross-Supervised Joint-Event-Extraction with Heterogeneous Information
Networks [61.950353376870154]
Joint-event-extraction is a sequence-to-sequence labeling task with a tag set composed of tags of triggers and entities.
We propose a Cross-Supervised Mechanism (CSM) to alternately supervise the extraction of triggers or entities.
Our approach outperforms the state-of-the-art methods in both entity and trigger extraction.
arXiv Detail & Related papers (2020-10-13T11:51:17Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.