AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with
Pretrained ViT
- URL: http://arxiv.org/abs/2309.08134v1
- Date: Fri, 15 Sep 2023 04:05:01 GMT
- Title: AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with
Pretrained ViT
- Authors: Fangbo Qin, Taogang Hou, Shan Lin, Kaiyuan Wang, Michael C. Yip, Shan
Yu
- Abstract summary: We propose a one-shot instance-aware object keypoint (OKP) extraction approach, AnyOKP, for flexible object-centric visual perception.
An off-the-shelf petrained vision transformer (ViT) is deployed for generalizable and transferable feature extraction.
AnyOKP is evaluated on real object images collected with the cameras of a robot arm, a mobile robot, and a surgical robot.
- Score: 28.050252998288478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Towards flexible object-centric visual perception, we propose a one-shot
instance-aware object keypoint (OKP) extraction approach, AnyOKP, which
leverages the powerful representation ability of pretrained vision transformer
(ViT), and can obtain keypoints on multiple object instances of arbitrary
category after learning from a support image. An off-the-shelf petrained ViT is
directly deployed for generalizable and transferable feature extraction, which
is followed by training-free feature enhancement. The best-prototype pairs
(BPPs) are searched for in support and query images based on appearance
similarity, to yield instance-unaware candidate keypoints.Then, the entire
graph with all candidate keypoints as vertices are divided to sub-graphs
according to the feature distributions on the graph edges. Finally, each
sub-graph represents an object instance. AnyOKP is evaluated on real object
images collected with the cameras of a robot arm, a mobile robot, and a
surgical robot, which not only demonstrates the cross-category flexibility and
instance awareness, but also show remarkable robustness to domain shift and
viewpoint change.
Related papers
- Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Learning-based Relational Object Matching Across Views [63.63338392484501]
We propose a learning-based approach which combines local keypoints with novel object-level features for matching object detections between RGB images.
We train our object-level matching features based on appearance and inter-frame and cross-frame spatial relations between objects in an associative graph neural network.
arXiv Detail & Related papers (2023-05-03T19:36:51Z) - USEEK: Unsupervised SE(3)-Equivariant 3D Keypoints for Generalizable
Manipulation [19.423310410631085]
U.S.EEK is an unsupervised SE(3)-equivariant keypoints method that enjoys alignment across instances in a category.
With USEEK in hand, the robot can infer the category-level task-relevant object frames in an efficient and explainable manner.
arXiv Detail & Related papers (2022-09-28T06:42:29Z) - Pose for Everything: Towards Category-Agnostic Pose Estimation [93.07415325374761]
Category-Agnostic Pose Estimation (CAPE) aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition.
A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images.
We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms.
arXiv Detail & Related papers (2022-07-21T09:40:54Z) - AutoLink: Self-supervised Learning of Human Skeletons and Object
Outlines by Linking Keypoints [16.5436159805682]
We propose a self-supervised method that learns to disentangle object structure from the appearance.
Both the keypoint location and their pairwise edge weights are learned, given only a collection of images depicting the same object class.
The resulting graph is interpretable, for example, AutoLink recovers the human skeleton topology when applied to images showing people.
arXiv Detail & Related papers (2022-05-21T16:32:34Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - End-to-end Reinforcement Learning of Robotic Manipulation with Robust
Keypoints Representation [7.374994747693731]
We present an end-to-end Reinforcement Learning framework for robotic manipulation tasks, using a robust and efficient keypoints representation.
The proposed method learns keypoints from camera images as the state representation, through a self-supervised autoencoder architecture.
We demonstrate the effectiveness of the proposed method on robotic manipulation tasks including grasping and pushing, in different scenarios.
arXiv Detail & Related papers (2022-02-12T09:58:09Z) - Semantically Grounded Object Matching for Robust Robotic Scene
Rearrangement [21.736603698556042]
We present a novel approach to object matching that uses a large pre-trained vision-language model to match objects in a cross-instance setting.
We demonstrate that this provides considerably improved matching performance in cross-instance settings.
arXiv Detail & Related papers (2021-11-15T18:39:43Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - Point-Set Anchors for Object Detection, Instance Segmentation and Pose
Estimation [85.96410825961966]
We argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries.
To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions.
We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation.
arXiv Detail & Related papers (2020-07-06T15:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.