Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications
- URL: http://arxiv.org/abs/2509.14921v1
- Date: Thu, 18 Sep 2025 12:58:18 GMT
- Title: Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications
- Authors: Tahar Chettaoui, Naser Damer, Fadi Boutros,
- Abstract summary: Foundation models such as CLIP have demonstrated exceptional zero- and few-shot transfer capabilities across diverse vision tasks.<n>However, fine-tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD) may suffer from over-specialization.<n>We systematically quantify these trade-offs by evaluating three instances of CLIP fine-tuned for FR, MAD, and PAD.
- Score: 18.08946802592489
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foundation models such as CLIP have demonstrated exceptional zero- and few-shot transfer capabilities across diverse vision tasks. However, when fine-tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD), these models may suffer from over-specialization. Thus, they may lose one of their foundational strengths, cross-domain generalization. In this work, we systematically quantify these trade-offs by evaluating three instances of CLIP fine-tuned for FR, MAD, and PAD. We evaluate each adapted model as well as the original CLIP baseline on 14 general vision datasets under zero-shot and linear-probe protocols, alongside common FR, MAD, and PAD benchmarks. Our results indicate that fine-tuned models suffer from over-specialization, especially when fine-tuned for complex tasks of FR. Also, our results pointed out that task complexity and classification head design, multi-class (FR) vs. binary (MAD and PAD), correlate with the degree of catastrophic forgetting. The FRoundation model with the ViT-L backbone outperforms other approaches on the large-scale FR benchmark IJB-C, achieving an improvement of up to 58.52%. However, it experiences a substantial performance drop on ImageNetV2, reaching only 51.63% compared to 69.84% achieved by the baseline CLIP model. Moreover, the larger CLIP architecture consistently preserves more of the model's original generalization ability than the smaller variant, indicating that increased model capacity may help mitigate over-specialization.
Related papers
- Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta [0.0]
classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges.<n>We propose a robust framework that integrates the hybrid CoAtNet architecture with model soups.<n>Our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines.
arXiv Detail & Related papers (2026-03-02T18:50:15Z) - Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z) - Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners [1.2461503242570642]
This paper presents a study on adapting the Contrastive Captioners (CoCa) visual backbone for few-shot image classification.<n>We identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning.<n>We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy.
arXiv Detail & Related papers (2025-12-14T20:13:21Z) - Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation [61.248535801314375]
Subset-Selected Counterfactual Augmentation (SS-CA)<n>We develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions.<n>Experiments show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks.
arXiv Detail & Related papers (2025-11-15T08:39:22Z) - Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks [0.7988085110283119]
Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs only marginally outperform random baselines.<n>We provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph.<n> Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods.
arXiv Detail & Related papers (2025-05-23T15:37:40Z) - Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization.
Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z) - A Simple and Generalist Approach for Panoptic Segmentation [57.94892855772925]
We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction.<n>We show that this is due to imbalance during training and propose a novel method for reducing it.<n>Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset.
arXiv Detail & Related papers (2024-08-29T13:02:12Z) - Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations [19.800907485589402]
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks.
These tuned models tend to become highly specialized, limiting their practicality for real-world deployment.
We propose a lightweight representation calibration method for fine-tuning CLIP.
arXiv Detail & Related papers (2024-03-12T01:47:17Z) - Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness [52.9493817508055]
We propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) to enhance the model's zero-shot adversarial robustness.
Our approach consistently improves clean accuracy by an average of 8.72%.
arXiv Detail & Related papers (2024-01-09T04:33:03Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - Mitigating the Alignment Tax of RLHF [76.4300447532456]
aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax.
We propose model averaging to maximize alignment performance while incurring minimal alignment tax.
We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
arXiv Detail & Related papers (2023-09-12T14:16:54Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Mitigating Spurious Correlations in Multi-modal Models during
Fine-tuning [18.45898471459533]
Spurious correlations that degrade model generalization or lead the model to be right for the wrong reasons are one of the main robustness concerns for real-world deployments.
This paper proposes a novel approach to address spurious correlations during fine-tuning for a given domain of interest.
arXiv Detail & Related papers (2023-04-08T05:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.