SHIC: Shape-Image Correspondences with no Keypoint Supervision
- URL: http://arxiv.org/abs/2407.18907v1
- Date: Fri, 26 Jul 2024 17:58:59 GMT
- Title: SHIC: Shape-Image Correspondences with no Keypoint Supervision
- Authors: Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi,
- Abstract summary: Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template.
Popularised by DensePose for the analysis of humans, authors have attempted to apply the concept to more categories.
We introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories.
- Score: 106.99157362200867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.
Related papers
- NPC: Neural Point Characters from Video [21.470471345454524]
High-fidelity human 3D models can now be learned directly from videos.
Previous methods avoid using a template but rely on a costly or ill-posed mapping from observation to canonical space.
We propose a hybrid point-based representation for reconstructing animatable characters.
arXiv Detail & Related papers (2023-04-04T17:59:22Z) - Generalizable Person Re-Identification via Viewpoint Alignment and
Fusion [74.30861504619851]
This work proposes to use a 3D dense pose estimation model and a texture mapping module to map pedestrian images to canonical view images.
Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images.
We show that our method can lead to superior performance over the existing approaches in various evaluation settings.
arXiv Detail & Related papers (2022-12-05T16:24:09Z) - Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z) - CAMERAS: Enhanced Resolution And Sanity preserving Class Activation
Mapping for image saliency [61.40511574314069]
Backpropagation image saliency aims at explaining model predictions by estimating model-centric importance of individual pixels in the input.
We propose CAMERAS, a technique to compute high-fidelity backpropagation saliency maps without requiring any external priors.
arXiv Detail & Related papers (2021-06-20T08:20:56Z) - Learning Generative Models of Textured 3D Meshes from Real-World Images [26.353307246909417]
We propose a GAN framework for generating textured triangle meshes without relying on such annotations.
We show that the performance of our approach is on par with prior work that relies on ground-truth keypoints.
arXiv Detail & Related papers (2021-03-29T14:07:37Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Implicit Mesh Reconstruction from Unannotated Image Collections [48.85604987196472]
We present an approach to infer the 3D shape, texture, and camera pose for an object from a single RGB image.
We represent the shape as an image-conditioned implicit function that transforms the surface of a sphere to that of the predicted mesh, while additionally predicting the corresponding texture.
arXiv Detail & Related papers (2020-07-16T17:55:20Z) - Novel Object Viewpoint Estimation through Reconstruction Alignment [45.16865218423492]
We learn a reconstruct and align approach to estimate the viewpoint of a novel object.
In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss.
At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image.
arXiv Detail & Related papers (2020-06-05T17:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.