CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization
- URL: http://arxiv.org/abs/2410.03054v1
- Date: Fri, 4 Oct 2024 00:23:20 GMT
- Title: CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization
- Authors: Shigemichi Matsuzaki, Kazuhito Tanaka, Kazuhiro Shintani,
- Abstract summary: One of the most promising approaches for localization on object maps is to use semantic graph matching.
To address the former issue, we augment the correspondence matching using Vision Language Models.
In addition, inliers are estimated deterministically using a graph-theoretic approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This letter proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and sensitive to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and observation completeness to improve the robustness. We confirmed improvements in matching and pose estimation accuracy through experiments on ScanNet and TUM datasets.
Related papers
- GOReloc: Graph-based Object-Level Relocalization for Visual SLAM [17.608119427712236]
This article introduces a novel method for object-level relocalization of robotic systems.
It determines the pose of a camera sensor by robustly associating the object detections in the current frame with 3D objects in a lightweight object-level map.
arXiv Detail & Related papers (2024-08-15T03:54:33Z) - CLIP-Loc: Multi-modal Landmark Association for Global Localization in
Object-based Maps [0.16492989697868893]
This paper describes a multi-modal data association method for global localization using object-based maps and camera images.
We propose labeling landmarks with natural language descriptions and extracting correspondences based on conceptual similarity with image observations.
arXiv Detail & Related papers (2024-02-08T22:59:12Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Loop Closure Detection Based on Object-level Spatial Layout and Semantic
Consistency [14.694754836704819]
We present an object-based loop closure detection method based on the spatial layout and semanic consistency of the 3D scene graph.
Experimental results demonstrate that our proposed data association approach can construct more accurate 3D semantic maps.
arXiv Detail & Related papers (2023-04-11T11:20:51Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor
Points [15.953570826460869]
Establishing dense correspondence between two images is a fundamental computer vision problem.
We introduce DenseGAP, a new solution for efficient Dense correspondence learning with a Graph-structured neural network conditioned on Anchor Points.
Our method advances the state-of-the-art of correspondence learning on most benchmarks.
arXiv Detail & Related papers (2021-12-13T18:59:30Z) - Object-Augmented RGB-D SLAM for Wide-Disparity Relocalisation [3.888848425698769]
We propose a novel object-augmented RGB-D SLAM system that is capable of constructing a consistent object map and performing relocalisation based on centroids of objects in the map.
arXiv Detail & Related papers (2021-08-05T11:02:25Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.