Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
- URL: http://arxiv.org/abs/2312.06594v2
- Date: Mon, 23 Sep 2024 14:23:07 GMT
- Title: Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
- Authors: Aditya Prakash, Arjun Gupta, Saurabh Gupta,
- Abstract summary: Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view.
We propose Intrinsics-Aware Positional.
benchmarks (KPE), which incorporates information about the location of crops in the image and camera shapes.
Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D of articulated objects on ARCTIC, show the benefits of KPE.
- Score: 17.074716363691294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.
Related papers
- SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views [36.02533658048349]
We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for sparse-view images.
SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views.
It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.
arXiv Detail & Related papers (2024-08-19T17:53:10Z) - CAPE: Camera View Position Embedding for Multi-View 3D Object Detection [100.02565745233247]
Current query-based methods rely on global 3D position embeddings to learn the geometric correspondence between images and 3D space.
We propose a novel method based on CAmera view Position Embedding, called CAPE.
CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on nuScenes dataset.
arXiv Detail & Related papers (2023-03-17T18:59:54Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - Neural Correspondence Field for Object Pose Estimation [67.96767010122633]
We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image.
Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum.
arXiv Detail & Related papers (2022-07-30T01:48:23Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [43.02373021724797]
We introduce a framework for multi-camera 3D object detection.
Our method manipulates predictions directly in 3D space.
We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
arXiv Detail & Related papers (2021-10-13T17:59:35Z) - MonoCInIS: Camera Independent Monocular 3D Object Detection using
Instance Segmentation [55.96577490779591]
Methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data.
We show that more data does not automatically guarantee a better performance, but rather, methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data.
arXiv Detail & Related papers (2021-10-01T14:56:37Z) - CoCoNets: Continuous Contrastive 3D Scene Representations [21.906643302668716]
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos.
We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection.
arXiv Detail & Related papers (2021-04-08T15:50:47Z) - 3D Object Recognition By Corresponding and Quantizing Neural 3D Scene
Representations [29.61554189447989]
We propose a system that learns to detect objects and infer their 3D poses in RGB-D images.
Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations.
arXiv Detail & Related papers (2020-10-30T13:56:09Z) - Shape and Viewpoint without Keypoints [63.26977130704171]
We present a learning framework that learns to recover the 3D shape, pose and texture from a single image.
We trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision.
We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects.
arXiv Detail & Related papers (2020-07-21T17:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.