Articulation-aware Canonical Surface Mapping
- URL: http://arxiv.org/abs/2004.00614v3
- Date: Tue, 26 May 2020 22:22:44 GMT
- Title: Articulation-aware Canonical Surface Mapping
- Authors: Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani
- Abstract summary: We tackle the tasks of predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape, and inferring the articulation and pose of the template corresponding to the input image.
Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions.
We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.
- Score: 54.0990446915042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that
indicates the mapping from 2D pixels to corresponding points on a canonical
template shape, and 2) inferring the articulation and pose of the template
corresponding to the input image. While previous approaches rely on keypoint
supervision for learning, we present an approach that can learn without such
annotations. Our key insight is that these tasks are geometrically related, and
we can obtain supervisory signal via enforcing consistency among the
predictions. We present results across a diverse set of animal object
categories, showing that our method can learn articulation and CSM prediction
from image collections using only foreground mask labels for training. We
empirically show that allowing articulation helps learn more accurate CSM
prediction, and that enforcing the consistency with predicted CSM is similarly
critical for learning meaningful articulation.
Related papers
- SHIC: Shape-Image Correspondences with no Keypoint Supervision [106.99157362200867]
Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template.
Popularised by DensePose for the analysis of humans, authors have attempted to apply the concept to more categories.
We introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories.
arXiv Detail & Related papers (2024-07-26T17:58:59Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences.
We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision.
Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - ATCON: Attention Consistency for Vision Models [0.8312466807725921]
We propose an unsupervised fine-tuning method that improves the consistency of attention maps.
We show results on Grad-CAM and Integrated Gradients in an ablation study.
Those improved attention maps may help clinicians better understand vision model predictions.
arXiv Detail & Related papers (2022-10-18T09:30:20Z) - Unsupervised learning of features and object boundaries from local
prediction [0.0]
We introduce a layer of feature maps with a pairwise Markov random field model in which each factor is paired with an additional binary variable, which switches the factor on or off.
We can learn both the features and the parameters of the Markov random field factors from images without further supervision signals.
We show that computing predictions across space aids both segmentation and feature learning, and models trained to optimize these predictions show similarities to the human visual system.
arXiv Detail & Related papers (2022-05-27T18:54:10Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Implicit Mesh Reconstruction from Unannotated Image Collections [48.85604987196472]
We present an approach to infer the 3D shape, texture, and camera pose for an object from a single RGB image.
We represent the shape as an image-conditioned implicit function that transforms the surface of a sphere to that of the predicted mesh, while additionally predicting the corresponding texture.
arXiv Detail & Related papers (2020-07-16T17:55:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.