A Visual Navigation Perspective for Category-Level Object Pose
Estimation
- URL: http://arxiv.org/abs/2203.13572v1
- Date: Fri, 25 Mar 2022 10:57:37 GMT
- Title: A Visual Navigation Perspective for Category-Level Object Pose
Estimation
- Authors: Jiaxin Guo, Fangxun Zhong, Rong Xiong, Yunhui Liu, Yue Wang, Yiyi Liao
- Abstract summary: This paper studies category-level object pose estimation based on a single monocular image.
Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis.
- Score: 41.60364392204057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies category-level object pose estimation based on a single
monocular image. Recent advances in pose-aware generative models have paved the
way for addressing this challenging task using analysis-by-synthesis. The idea
is to sequentially update a set of latent variables, e.g., pose, shape, and
appearance, of the generative model until the generated image best agrees with
the observation. However, convergence and efficiency are two challenges of this
inference procedure. In this paper, we take a deeper look at the inference of
analysis-by-synthesis from the perspective of visual navigation, and
investigate what is a good navigation policy for this specific task. We
evaluate three different strategies, including gradient descent, reinforcement
learning and imitation learning, via thorough comparisons in terms of
convergence, robustness and efficiency. Moreover, we show that a simple hybrid
approach leads to an effective and efficient solution. We further compare these
strategies to state-of-the-art methods, and demonstrate superior performance on
synthetic and real-world datasets leveraging off-the-shelf pose-aware
generative models.
Related papers
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.
We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.
We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes [35.16430027877207]
MOVIS aims to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS.
We introduce an auxiliary task requiring the model to simultaneously predict novel view object masks.
To evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement.
arXiv Detail & Related papers (2024-12-16T05:23:45Z) - Distillation of Diffusion Features for Semantic Correspondence [23.54555663670558]
We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency.
We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost.
Our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence.
arXiv Detail & Related papers (2024-12-04T17:55:33Z) - Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis [48.945931374180795]
This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa.
We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively.
arXiv Detail & Related papers (2024-12-04T13:47:51Z) - Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching [19.730504197461144]
We present a novel generalizable object pose estimation method to determine the object pose using only one RGB image.
Our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.
arXiv Detail & Related papers (2024-11-24T14:31:50Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z) - CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z) - Fusing Local Similarities for Retrieval-based 3D Orientation Estimation
of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images.
We follow a retrieval-based strategy and prevent the network from learning object-specific features.
Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z) - Neural Topological SLAM for Visual Navigation [112.73876869904]
We design topological representations for space that leverage semantics and afford approximate geometric reasoning.
We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation.
arXiv Detail & Related papers (2020-05-25T17:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.