A Visual Navigation Perspective for Category-Level Object Pose
Estimation
- URL: http://arxiv.org/abs/2203.13572v1
- Date: Fri, 25 Mar 2022 10:57:37 GMT
- Title: A Visual Navigation Perspective for Category-Level Object Pose
Estimation
- Authors: Jiaxin Guo, Fangxun Zhong, Rong Xiong, Yunhui Liu, Yue Wang, Yiyi Liao
- Abstract summary: This paper studies category-level object pose estimation based on a single monocular image.
Recent advances in pose-aware generative models have paved the way for addressing this challenging task using analysis-by-synthesis.
- Score: 41.60364392204057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies category-level object pose estimation based on a single
monocular image. Recent advances in pose-aware generative models have paved the
way for addressing this challenging task using analysis-by-synthesis. The idea
is to sequentially update a set of latent variables, e.g., pose, shape, and
appearance, of the generative model until the generated image best agrees with
the observation. However, convergence and efficiency are two challenges of this
inference procedure. In this paper, we take a deeper look at the inference of
analysis-by-synthesis from the perspective of visual navigation, and
investigate what is a good navigation policy for this specific task. We
evaluate three different strategies, including gradient descent, reinforcement
learning and imitation learning, via thorough comparisons in terms of
convergence, robustness and efficiency. Moreover, we show that a simple hybrid
approach leads to an effective and efficient solution. We further compare these
strategies to state-of-the-art methods, and demonstrate superior performance on
synthetic and real-world datasets leveraging off-the-shelf pose-aware
generative models.
Related papers
- Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training [102.82553402539139]
Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image.
These models often face challenges in maintaining consistency across novel and reference views.
We propose to use epipolar geometry to locate and retrieve overlapping information from the input view.
This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning.
arXiv Detail & Related papers (2025-02-25T14:04:22Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.
We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.
We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - Distillation of Diffusion Features for Semantic Correspondence [23.54555663670558]
We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency.
We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost.
Our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence.
arXiv Detail & Related papers (2024-12-04T17:55:33Z) - Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis [48.945931374180795]
This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa.
We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively.
arXiv Detail & Related papers (2024-12-04T13:47:51Z) - Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching [19.730504197461144]
We present a novel generalizable object pose estimation method to determine the object pose using only one RGB image.
Our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.
arXiv Detail & Related papers (2024-11-24T14:31:50Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Relational Object-Centric Actor-Critic [44.99833362998488]
Recent works highlight that disentangled object representations can aid policy learning in image-based, object-centric reinforcement learning tasks.
This paper proposes a novel object-centric reinforcement learning algorithm that integrates actor-critic and model-based approaches.
We evaluate our method in a simulated 3D robotic environment and a 2D environment with compositional structure.
arXiv Detail & Related papers (2023-10-26T06:05:12Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z) - CroCo v2: Improved Cross-view Completion Pre-training for Stereo
Matching and Optical Flow [22.161967080759993]
Self-supervised pre-training methods have not yet delivered on dense geometric vision tasks such as stereo matching or optical flow.
We build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene.
We show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques.
arXiv Detail & Related papers (2022-11-18T18:18:53Z) - Robust Single Image Dehazing Based on Consistent and Contrast-Assisted
Reconstruction [95.5735805072852]
We propose a novel density-variational learning framework to improve the robustness of the image dehzing model.
Specifically, the dehazing network is optimized under the consistency-regularized framework.
Our method significantly surpasses the state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T08:11:04Z) - Fusing Local Similarities for Retrieval-based 3D Orientation Estimation
of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images.
We follow a retrieval-based strategy and prevent the network from learning object-specific features.
Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z) - CoSformer: Detecting Co-Salient Object with Transformers [2.3148470932285665]
Co-Salient Object Detection (CoSOD) aims at simulating the human visual system to discover the common and salient objects from a group of relevant images.
We propose the Co-Salient Object Detection Transformer (CoSformer) network to capture both salient and common visual patterns from multiple images.
arXiv Detail & Related papers (2021-04-30T02:39:12Z) - Deep Graph Contrastive Representation Learning [23.37786673825192]
We propose a novel framework for unsupervised graph representation learning by leveraging a contrastive objective at the node level.
Specifically, we generate two graph views by corruption and learn node representations by maximizing the agreement of node representations in these two views.
We perform empirical experiments on both transductive and inductive learning tasks using a variety of real-world datasets.
arXiv Detail & Related papers (2020-06-07T11:50:45Z) - Neural Topological SLAM for Visual Navigation [112.73876869904]
We design topological representations for space that leverage semantics and afford approximate geometric reasoning.
We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation.
arXiv Detail & Related papers (2020-05-25T17:56:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.