NearID: Identity Representation Learning via Near-identity Distractors
Abstract Overview
This paper identifies a systematic failure mode in vision encoders used for identity-focused tasks: embeddings entangle object identity with background context, allowing visually similar but distinct objects placed on the same background to outscore true cross-view matches. To address this, the authors introduce NearID, a framework built around matched-context "near-identity distractors" that isolate identity as the sole discriminative signal. The framework comprises a curated dataset of 19,386 identities with 316,505 distractors generated from four generative models, a two-tier contrastive loss enforcing the ordering same identity > near-identity distractor > random negative, and a strict margin-based evaluation protocol (SSR and PA). Rather than fine-tuning the full encoder, the method keeps a SigLIP2 backbone frozen and trains only a lightweight MAP projection head (~3.6% of total parameters).
Novelty
The key novelty is the explicit construction and use of matched-context near-identity distractors—semantically similar but distinct objects inpainted into the exact same background as the reference—to remove contextual shortcuts and test whether representations truly encode object identity. The paper pairs this data construction with a two-component contrastive objective (discrimination + ranking regularizer) that enforces a three-tier similarity hierarchy, along with a margin-based evaluation protocol specifically designed to quantify identity-context entanglement.
Results
On the NearID benchmark, the frozen SigLIP2 baseline achieves only 30.74% SSR, whereas the proposed method reaches 99.17% SSR and 99.71% PA. On the MTG part-level benchmark, SSR improves from 0.0% (for all standard encoders) to 35.0%, and metric-to-oracle Pearson correlation increases from 0.180 to 0.465. On DreamBench++, metric-to-human correlation improves from 0.516 to 0.545, with gains generalizing to unseen categories including animals (+0.105) and humans (+0.065).
Key Points
- NearID defines a matched-context evaluation setting in which near-identity distractors share the exact background of the reference image, exposing systematic background-driven failures in standard vision encoders (e.g., SigLIP2 achieves only 30.74% SSR).
- The training method uses a frozen SigLIP2 backbone with a lightweight MAP projection head (~3.6% of parameters) and a two-component loss (discrimination + ranking regularizer) that enforces a structured three-tier similarity hierarchy rather than treating all negatives equally.
- Ablation experiments show that standard contrastive losses are insufficient (InfoNCE reaches only 60.97% SSR) and aggressive hierarchical losses risk representation collapse (Circle+Ranking drops M–H to 0.141), while the NearID loss balances near-perfect discrimination (99.17% SSR) with maintained human alignment (M–H = 0.545).
References
- arXiv: https://arxiv.org/abs/2604.01973v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.01973v1
- Hugging Face Papers: https://huggingface.co/papers/2604.01973
- Project: https://gorluxor.github.io/NearID/