Related papers: Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

URL: http://arxiv.org/abs/2506.08220v1
Date: Mon, 09 Jun 2025 20:40:47 GMT
Title: Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence
Authors: Octave Mariotti, Zhipeng Du, Yash Bhalgat, Oisin Mac Aodha, Hakan Bilen,
Abstract summary: We propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation.<n>Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations.
Score: 37.26437707181298
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

Related papers

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding [55.33317649771575]
Embodied Reference Understanding involves predicting the object that a person in the scene is referring to through both pointing gesture and language.<n>We propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction.<n>We present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features.
arXiv Detail & Related papers (2025-07-29T15:00:21Z)
Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels [69.58063088519852]
We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling.<n>Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining.<n>While reducing the need for dataset specific annotations, we set a new state-of-the-art on SPair-71k by over 4% absolute gain.
arXiv Detail & Related papers (2025-06-05T17:54:33Z)
Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning [65.75756724642932]
In incomplete multi-view clustering, missing data induce prototype shifts within views and semantic inconsistencies across views.<n>We propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL)<n>FreeCSL achieves more confident and robust assignments on IMVC task, compared to state-of-the-art competitors.
arXiv Detail & Related papers (2025-05-16T12:37:10Z)
Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps [39.00415825387414]
We propose a new approach for semantic correspondence estimation that supplements discriminative features with 3D understanding via a weak geometric spherical prior. Compared to more involved 3D pipelines, our model only requires weak viewpoint information, and the simplicity of our spherical representation enables us to inject informative geometric priors into the model during training. We present results on the challenging SPair-71k dataset, where our approach demonstrates is capable of distinguishing between symmetric views and repeated parts across many object categories.
arXiv Detail & Related papers (2023-12-20T17:35:24Z)
Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning [82.29761875805369]
One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes. We propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning. The proposed Semantic Anchor Regularization (SAR) can be used in a plug-and-play manner in the existing models.
arXiv Detail & Related papers (2023-12-19T05:52:38Z)
Unsupervised 3D Keypoint Discovery with Multi-View Geometry [104.76006413355485]
We propose an algorithm that learns to discover 3D keypoints on human bodies from multiple-view images without supervision or labels. Our approach discovers more interpretable and accurate 3D keypoints compared to other state-of-the-art unsupervised approaches.
arXiv Detail & Related papers (2022-11-23T10:25:12Z)
Unsupervised Learning of 3D Semantic Keypoints with Mutual Reconstruction [11.164069907549756]
3D semantic keypoints are category-level semantic consistent points on 3D objects. We present an unsupervised method to generate consistent semantic keypoints from point clouds explicitly. To the best of our knowledge, the proposed method is the first to mine 3D semantic consistent keypoints from a mutual reconstruction view.
arXiv Detail & Related papers (2022-03-19T01:49:21Z)
Unsupervised Learning on 3D Point Clouds by Clustering and Contrasting [11.64827192421785]
unsupervised representation learning is a promising direction to auto-extract features without human intervention. This paper proposes a general unsupervised approach, named textbfConClu, to perform the learning of point-wise and global features.
arXiv Detail & Related papers (2022-02-05T12:54:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.