Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
- URL: http://arxiv.org/abs/2602.19112v2
- Date: Tue, 24 Feb 2026 02:03:41 GMT
- Title: Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
- Authors: Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick,
- Abstract summary: UniMatch is a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes.<n>Our method is versatile for universal object categories and requires no predefined part proposals.
- Score: 8.772996147679729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.
Related papers
- Hierarchical Neural Semantic Representation for 3D Semantic Correspondence [72.8101601086805]
We design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features.<n>Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature.<n>Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories.
arXiv Detail & Related papers (2025-09-22T07:23:07Z) - Bridge the Gap Between Visual and Linguistic Comprehension for Generalized Zero-shot Semantic Segmentation [39.17707407384492]
Generalized zero-shot semantic segmentation (GZS3) aims to achieve the human-level capability of segmenting seen and unseen classes.<n>We propose a Decoupled Vision-Language Matching (DeVLMatch) framework, composed of spatial-part (SPMatch) and channel-state (CSMatch) matching modules.
arXiv Detail & Related papers (2025-03-31T07:39:14Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - Zero-Shot 3D Shape Correspondence [67.18775201037732]
We propose a novel zero-shot approach to computing correspondences between 3D shapes.
We exploit the exceptional reasoning capabilities of recent foundation models in language and vision.
Our approach produces highly plausible results in a zero-shot manner, especially between strongly non-isometric shapes.
arXiv Detail & Related papers (2023-06-05T21:14:23Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z) - Contrastive Video-Language Segmentation [41.1635597261304]
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content.
We propose to interwind the visual and linguistic modalities in an explicit way via the contrastive learning objective.
arXiv Detail & Related papers (2021-09-29T01:40:58Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.