Improving Face Recognition from Caption Supervision with Multi-Granular
Contextual Feature Aggregation
- URL: http://arxiv.org/abs/2308.06866v1
- Date: Sun, 13 Aug 2023 23:52:15 GMT
- Title: Improving Face Recognition from Caption Supervision with Multi-Granular
Contextual Feature Aggregation
- Authors: Md Mahedi Hasan and Nasser Nasrabadi
- Abstract summary: We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition systems.
We implement the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce caption-guided face recognition (CGFR) as a new framework to
improve the performance of commercial-off-the-shelf (COTS) face recognition
(FR) systems. In contrast to combining soft biometrics (eg., facial marks,
gender, and age) with face images, in this work, we use facial descriptions
provided by face examiners as a piece of auxiliary information. However, due to
the heterogeneity of the modalities, improving the performance by directly
fusing the textual and facial features is very challenging, as both lie in
different embedding spaces. In this paper, we propose a contextual feature
aggregation module (CFAM) that addresses this issue by effectively exploiting
the fine-grained word-region interaction and global image-caption association.
Specifically, CFAM adopts a self-attention and a cross-attention scheme for
improving the intra-modality and inter-modality relationship between the image
and textual features, respectively. Additionally, we design a textual feature
refinement module (TFRM) that refines the textual features of the pre-trained
BERT encoder by updating the contextual embeddings. This module enhances the
discriminative power of textual features with a cross-modal projection loss and
realigns the word and caption embeddings with visual features by incorporating
a visual-semantic alignment loss. We implemented the proposed CGFR framework on
two face recognition models (ArcFace and AdaFace) and evaluated its performance
on the Multi-Modal CelebA-HQ dataset. Our framework significantly improves the
performance of ArcFace in both 1:1 verification and 1:N identification
protocol.
Related papers
- FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding [26.768147543628096]
We propose a novel framework that emphasizes object and context comprehension inspired by human cognitive processes.
Our method achieves significant performance improvements on three benchmark datasets.
arXiv Detail & Related papers (2024-04-12T16:38:48Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM)
Our approach introduces canonical class-aware glyph masks to suppress background and text style noise.
By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z) - Text-Guided Face Recognition using Multi-Granularity Cross-Modal
Contrastive Learning [0.0]
We introduce text-guided face recognition (TGFR) to analyze the impact of integrating facial attributes in the form of natural language descriptions.
TGFR demonstrates remarkable improvements, particularly on low-quality images, over existing face recognition models.
arXiv Detail & Related papers (2023-12-14T22:04:22Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - More comprehensive facial inversion for more effective expression
recognition [8.102564078640274]
We propose a novel generative method based on the image inversion mechanism for the FER task, termed Inversion FER (IFER)
ASIT is equipped with an image inversion discriminator that measures the cosine similarity of semantic features between source and generated images, constrained by a distribution alignment loss.
We extensively evaluate ASIT on facial datasets such as FFHQ and CelebA-HQ, showing that our approach achieves state-of-the-art facial inversion performance.
arXiv Detail & Related papers (2022-11-24T12:31:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.