CoKe: Localized Contrastive Learning for Robust Keypoint Detection
- URL: http://arxiv.org/abs/2009.14115v3
- Date: Mon, 23 Nov 2020 16:22:35 GMT
- Title: CoKe: Localized Contrastive Learning for Robust Keypoint Detection
- Authors: Yutong Bai, Angtian Wang, Adam Kortylewski, Alan Yuille
- Abstract summary: We show that keypoint kernels can be chosen to optimize three types of distances in the feature space.
We formulate this optimization process within a framework, which includes supervised contrastive learning.
CoKe achieves state-of-the-art results compared to approaches that jointly represent all keypoints holistically.
- Score: 24.167397429511915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Today's most popular approaches to keypoint detection involve very complex
network architectures that aim to learn holistic representations of all
keypoints. In this work, we take a step back and ask: Can we simply learn a
local keypoint representation from the output of a standard backbone
architecture? This will help make the network simpler and more robust,
particularly if large parts of the object are occluded. We demonstrate that
this is possible by looking at the problem from the perspective of
representation learning. Specifically, the keypoint kernels need to be chosen
to optimize three types of distances in the feature space: Features of the same
keypoint should be similar to each other, while differing from those of other
keypoints, and also being distinct from features from the background clutter.
We formulate this optimization process within a framework, which we call CoKe,
which includes supervised contrastive learning. CoKe needs to make several
approximations to enable representation learning process on large datasets. In
particular, we introduce a clutter bank to approximate non-keypoint features,
and a momentum update to compute the keypoint representation while training the
feature extractor. Our experiments show that CoKe achieves state-of-the-art
results compared to approaches that jointly represent all keypoints
holistically (Stacked Hourglass Networks, MSS-Net) as well as to approaches
that are supervised by detailed 3D object geometry (StarMap). Moreover, CoKe is
robust and performs exceptionally well when objects are partially occluded and
significantly outperforms related work on a range of diverse datasets
(PASCAL3D+, MPII, ObjectNet3D).
Related papers
- Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Independently Keypoint Learning for Small Object Semantic Correspondence [7.3866687886529805]
Keypoint Bounding box-centered Cropping method proposed for small object semantic correspondence.
KBCNet comprises a Cross-Scale Feature Alignment (CSFA) module and an efficient 4D convolutional decoder.
Our method demonstrates a substantial performance improvement of 7.5% on the SPair-71k dataset.
arXiv Detail & Related papers (2024-04-03T12:21:41Z) - Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space.
We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z) - Interacting Hand-Object Pose Estimation via Dense Mutual Attention [97.26400229871888]
3D hand-object pose estimation is the key to the success of many computer vision applications.
We propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object.
Our method is able to produce physically plausible poses with high quality and real-time inference speed.
arXiv Detail & Related papers (2022-11-16T10:01:33Z) - Self-attention on Multi-Shifted Windows for Scene Segmentation [14.47974086177051]
We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features.
We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction.
Our models achieve very promising performance on four public scene segmentation datasets.
arXiv Detail & Related papers (2022-07-10T07:36:36Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - Sim2Real Object-Centric Keypoint Detection and Description [40.58367357980036]
Keypoint detection and description play a central role in computer vision.
We propose the object-centric formulation, which requires further identifying which object each interest point belongs to.
We develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.
arXiv Detail & Related papers (2022-02-01T15:00:20Z) - DFC: Deep Feature Consistency for Robust Point Cloud Registration [0.4724825031148411]
We present a novel learning-based alignment network for complex alignment scenes.
We validate our approach on the 3DMatch dataset and the KITTI odometry dataset.
arXiv Detail & Related papers (2021-11-15T08:27:21Z) - End-to-End Learning of Keypoint Representations for Continuous Control
from Images [84.8536730437934]
We show that it is possible to learn efficient keypoint representations end-to-end, without the need for unsupervised pre-training, decoders, or additional losses.
Our proposed architecture consists of a differentiable keypoint extractor that feeds the coordinates directly to a soft actor-critic agent.
arXiv Detail & Related papers (2021-06-15T09:17:06Z) - S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z) - Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints.
Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information.
To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy.
We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.