Fusion and Orthogonal Projection for Improved Face-Voice Association
- URL: http://arxiv.org/abs/2112.10483v1
- Date: Mon, 20 Dec 2021 12:33:33 GMT
- Title: Fusion and Orthogonal Projection for Improved Face-Voice Association
- Authors: Muhammad Saad Saeed, Muhammad Haris Khan, Shah Nawaz, Muhammad Haroon
Yousaf, Alessio Del Bue
- Abstract summary: We study the problem of learning association between face and voice.
We propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings.
- Score: 15.938463726577128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of learning association between face and voice, which is
gaining interest in the computer vision community lately. Prior works adopt
pairwise or triplet loss formulations to learn an embedding space amenable for
associated matching and verification tasks. Albeit showing some progress, such
loss formulations are, however, restrictive due to dependency on
distance-dependent margin parameter, poor run-time training complexity, and
reliance on carefully crafted negative mining procedures. In this work, we
hypothesize that enriched feature representation coupled with an effective yet
efficient supervision is necessary in realizing a discriminative joint
embedding space for improved face-voice association. To this end, we propose a
light-weight, plug-and-play mechanism that exploits the complementary cues in
both modalities to form enriched fused embeddings and clusters them based on
their identity labels via orthogonality constraints. We coin our proposed
mechanism as fusion and orthogonal projection (FOP) and instantiate in a
two-stream pipeline. The overall resulting framework is evaluated on a
large-scale VoxCeleb dataset with a multitude of tasks, including cross-modal
verification and matching. Results show that our method performs favourably
against the current state-of-the-art methods and our proposed supervision
formulation is more effective and efficient than the ones employed by the
contemporary methods.
Related papers
- Understanding Human Activity with Uncertainty Measure for Novelty in Graph Convolutional Networks [2.223052975765005]
We introduce the Temporal Fusion Graph Convolutional Network.
It aims to rectify the inadequate boundary estimation of individual actions within an activity stream.
It also mitigates the issue of over-segmentation in the temporal dimension.
arXiv Detail & Related papers (2024-10-10T13:44:18Z) - Interactive Graph Convolutional Filtering [79.34979767405979]
Interactive Recommender Systems (IRS) have been increasingly used in various domains, including personalized article recommendation, social media, and online advertising.
These problems are exacerbated by the cold start problem and data sparsity problem.
Existing Multi-Armed Bandit methods, despite their carefully designed exploration strategies, often struggle to provide satisfactory results in the early stages.
Our proposed method extends interactive collaborative filtering into the graph model to enhance the performance of collaborative filtering between users and items.
arXiv Detail & Related papers (2023-09-04T09:02:31Z) - Unsupervised Visible-Infrared Person ReID by Collaborative Learning with Neighbor-Guided Label Refinement [53.044703127757295]
Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) aims at learning modality-invariant features from unlabeled cross-modality dataset.
We propose a Dual Optimal Transport Label Assignment (DOTLA) framework to simultaneously assign the generated labels from one modality to its counterpart modality.
The proposed DOTLA mechanism formulates a mutual reinforcement and efficient solution to cross-modality data association, which could effectively reduce the side-effects of some insufficient and noisy label associations.
arXiv Detail & Related papers (2023-05-22T04:40:30Z) - Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID [56.573905143954015]
We propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters.
Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level.
Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:27:46Z) - Batch Active Learning from the Perspective of Sparse Approximation [12.51958241746014]
Active learning enables efficient model training by leveraging interactions between machine learning agents and human annotators.
We study and propose a novel framework that formulates batch active learning from the sparse approximation's perspective.
Our active learning method aims to find an informative subset from the unlabeled data pool such that the corresponding training loss function approximates its full data pool counterpart.
arXiv Detail & Related papers (2022-11-01T03:20:28Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - Learning Branched Fusion and Orthogonal Projection for Face-Voice
Association [20.973188176888865]
We propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings.
Results reveal that our method performs favourably against the current state-of-the-art methods.
In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association.
arXiv Detail & Related papers (2022-08-22T12:23:09Z) - Real-time landmark detection for precise endoscopic submucosal
dissection via shape-aware relation network [51.44506007844284]
We propose a shape-aware relation network for accurate and real-time landmark detection in endoscopic submucosal dissection surgery.
We first devise an algorithm to automatically generate relation keypoint heatmaps, which intuitively represent the prior knowledge of spatial relations among landmarks.
We then develop two complementary regularization schemes to progressively incorporate the prior knowledge into the training process.
arXiv Detail & Related papers (2021-11-08T07:57:30Z) - Scalable Bayesian Inverse Reinforcement Learning [93.27920030279586]
We introduce Approximate Variational Reward Imitation Learning (AVRIL)
Our method addresses the ill-posed nature of the inverse reinforcement learning problem.
Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods.
arXiv Detail & Related papers (2021-02-12T12:32:02Z) - Subspace Clustering for Action Recognition with Covariance
Representations and Temporal Pruning [20.748083855677816]
This paper tackles the problem of human action recognition, defined as classifying which action is displayed in a trimmed sequence, from skeletal data.
We propose a novel subspace clustering method, which exploits covariance matrix to enhance the action's discriminability and a timestamp pruning approach that allow us to better handle the temporal dimension of the data.
arXiv Detail & Related papers (2020-06-21T14:44:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.