Related papers: Benchmarking Foundation Models for Zero-Shot Biometric Tasks

Benchmarking Foundation Models for Zero-Shot Biometric Tasks

URL: http://arxiv.org/abs/2505.24214v1
Date: Fri, 30 May 2025 04:53:55 GMT
Title: Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Authors: Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, Arun Ross,
Abstract summary: Foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs) have redefined the frontiers of artificial intelligence.<n>In this work, we introduce a benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks.<n>Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success.
Score: 8.276938046158163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.

Related papers

ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models [82.04858317800097]
We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts.<n>ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues.<n>We introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images.
arXiv Detail & Related papers (2025-08-02T15:21:26Z)
Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait [70.00430652562012]
FarSight is an end-to-end system for person recognition that integrates biometric cues across face, gait, and body shape modalities.<n>FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion.
arXiv Detail & Related papers (2025-05-07T17:58:25Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Representation Learning and Identity Adversarial Training for Facial Behavior Understanding [3.350769246260559]
We revisit two fundamental factors in AU detection: diverse and large-scale data and subject identity regularization.<n>Pretraining a masked autoencoder on Face9M yields strong performance in AU detection and facial expression tasks.<n>Our proposed methods, Facial Masked Autoencoder (FMAE) and IAT, are simple, generic and effective.
arXiv Detail & Related papers (2024-07-15T21:13:28Z)
CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection [3.849401956130233]
We explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. We employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. The simple and lightweight Prompt Tuning based adaptation strategy outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy.
arXiv Detail & Related papers (2024-02-20T11:26:42Z)
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [61.8876114116716]
Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-related tasks.<n>However, their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.<n>We introduce a benchmark, SHIELD, to evaluate MLLMs for face spoofing and forgery detection.
arXiv Detail & Related papers (2024-02-06T17:31:36Z)
Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z)
SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation [60.94239810407917]
This paper presents a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation based on a single Swin Transformer. To address the conflicts among multiple tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis. Experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks.
arXiv Detail & Related papers (2023-08-22T15:38:39Z)
Facial Soft Biometrics for Recognition in the Wild: Recent Works, Annotation, and COTS Evaluation [63.05890836038913]
We study the role of soft biometrics to enhance person recognition systems in unconstrained scenarios. We consider two assumptions: 1) manual estimation of soft biometrics and 2) automatic estimation from two commercial off-the-shelf systems. Experiments are carried out fusing soft biometrics with two state-of-the-art face recognition systems based on deep learning.
arXiv Detail & Related papers (2022-10-24T11:29:57Z)
Unsupervised Learning Facial Parameter Regressor for Action Unit Intensity Estimation via Differentiable Renderer [51.926868759681014]
We present a framework to predict the facial parameters based on a bone-driven face model (BDFM) under different views. The proposed framework consists of a feature extractor, a generator, and a facial parameter regressor.
arXiv Detail & Related papers (2020-08-20T09:49:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.