Related papers: PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

URL: http://arxiv.org/abs/2505.17002v2
Date: Wed, 28 May 2025 11:34:57 GMT
Title: PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association
Authors: Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman,
Abstract summary: We study the task of learning association between faces and voices.<n>We propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion.
Score: 9.21950270306253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.

Related papers

Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Be Decisive: Noise-Induced Layouts for Multi-Subject Generation [56.80513553424086]
Complex prompts lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features.<n>We introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process.<n>Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step.
arXiv Detail & Related papers (2025-05-27T17:54:24Z)
Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning [16.515613048905674]
Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL)<n>Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them.
arXiv Detail & Related papers (2025-05-18T15:54:55Z)
Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning [14.403812623299027]
Retrieval-Augmented Generation (RAG) offers an effective solution to the issues faced by Large Language Models (LLMs) in hallucination generation and knowledge obsolescence.<n>We propose Parenting, a novel framework that decouples, identifies, and purposefully optimize parameter subspaces related to adherence and robustness.
arXiv Detail & Related papers (2024-10-14T10:26:57Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP [22.076206386214565]
Contrastive Language--Image Pre-training has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. From a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. We show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings.
arXiv Detail & Related papers (2024-06-25T15:24:02Z)
DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z)
Learning Branched Fusion and Orthogonal Projection for Face-Voice Association [20.973188176888865]
We propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings. Results reveal that our method performs favourably against the current state-of-the-art methods. In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association.
arXiv Detail & Related papers (2022-08-22T12:23:09Z)
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition [46.443866373546726]
We focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention model that relies on the complementary relationships to extract the salient features. Our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-28T14:09:43Z)
Fusion and Orthogonal Projection for Improved Face-Voice Association [15.938463726577128]
We study the problem of learning association between face and voice. We propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings.
arXiv Detail & Related papers (2021-12-20T12:33:33Z)
Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA) IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors. IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
Joint Disentangling and Adaptation for Cross-Domain Person Re-Identification [88.79480792084995]
We propose a joint learning framework that disentangles id-related/unrelated features and enforces adaptation to work on the id-related feature space exclusively. Our model involves a disentangling module that encodes cross-domain images into a shared appearance space and two separate structure spaces, and an adaptation module that performs adversarial alignment and self-training on the shared appearance space.
arXiv Detail & Related papers (2020-07-20T17:57:02Z)
Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features. At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features. At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.