Related papers: Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

URL: http://arxiv.org/abs/2511.19846v1
Date: Tue, 25 Nov 2025 02:23:10 GMT
Title: Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum
Authors: Thomas M Metz, Matthew Q Hill, Alice J O'Toole,
Abstract summary: Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned.<n>We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space.<n>We introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks.
Score: 0.764671395172401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space -- without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

Related papers

OmniFD: A Unified Model for Versatile Face Forgery Detection [45.17431538516313]
We introduce OmniFD, a unified framework that jointly addresses four core forgery detection tasks within a single model.<n>Our architecture consists of three principal components: (1) a shared Swin Transformer that extracts unified 4D-temporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries, and (3) lightweight decoding heads that transform refined representations into corresponding predictions.
arXiv Detail & Related papers (2025-11-30T22:36:42Z)
One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection [37.44241182701723]
Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models.<n>Dinomaly2 is the first unified framework for full-spectrum image UAD.<n>Our model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively.
arXiv Detail & Related papers (2025-10-20T14:57:52Z)
PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models [80.65273820998875]
We present PanMatch, a versatile foundation model for robust correspondence matching.<n>Our key insight is that any two-frame correspondence matching task can be addressed within a 2D displacement estimation framework.<n>PanMatch achieves multi-task integration by endowing displacement estimation algorithms with unprecedented generalization capabilities.
arXiv Detail & Related papers (2025-07-11T08:18:52Z)
One RL to See Them All: Visual Triple Unified Reinforcement Learning [92.90120580989839]
We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables visual reasoning and perception tasks within a single training pipeline.<n>V-Triune comprises triple complementary components: Sample-Level Datashelf (to unify diverse task inputs), Verifier-Level Reward (to deliver custom rewards via specialized verifiers).<n>We introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune.
arXiv Detail & Related papers (2025-05-23T17:41:14Z)
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology [17.781388341968967]
CPath- Omni is the first LMM designed to unify both patch and WSI level image analysis.<n>CPath- Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets.<n>CPath-CLIP, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model.
arXiv Detail & Related papers (2024-12-16T18:46:58Z)
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. Recent solutions are mainly all-in-one models. We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z)
PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models [20.379104447051155]
We present PointSeg, a training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks.<n>PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames.<n>Our approach significantly surpasses the state-of-the-art specialist training-free model by 14.1$%$, 12.3$%$, and 12.6$%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets.
arXiv Detail & Related papers (2024-03-11T03:28:20Z)
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing. Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection. Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z)
Distribution Alignment: A Unified Framework for Long-tail Visual Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition. We then introduce a generalized re-weight method in the two-stage learning to balance the class prior. Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.