Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition
- URL: http://arxiv.org/abs/2601.15406v1
- Date: Wed, 21 Jan 2026 19:17:21 GMT
- Title: Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition
- Authors: Hatef Otroshi Shahreza, Anjith George, Sébastien Marcel,
- Abstract summary: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks.<n>We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-AL face recognition.<n>Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions.
- Score: 45.12459792999638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.
Related papers
- Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond [116.65158801881984]
We introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs.<n>We develop a unified and interpretable FER foundation model termed UniFER-7B.
arXiv Detail & Related papers (2025-11-01T03:53:00Z) - Benchmarking Multimodal Large Language Models for Face Recognition [44.02544110500887]
Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks.<n>We present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets.
arXiv Detail & Related papers (2025-10-16T16:42:27Z) - A Vision Centric Remote Sensing Benchmark [21.48675282619887]
This study investigates the limitations of CLIP-based MLLMs in remote sensing tasks.<n>We introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark.<n>It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs.<n>We analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning.
arXiv Detail & Related papers (2025-03-20T03:03:46Z) - LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? [59.81732629438753]
We propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features.<n>Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture.<n>We also introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models.
arXiv Detail & Related papers (2025-03-10T16:05:40Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image [16.040813949620958]
We introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis.<n>Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism.<n>This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks.
arXiv Detail & Related papers (2024-11-25T09:00:36Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering.
We present an experimental evaluation of different interfacing mechanisms, across multiple tasks.
We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z) - From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs)
Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding.
We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.