MonoLayout: Amodal scene layout from a single image
- URL: http://arxiv.org/abs/2002.08394v1
- Date: Wed, 19 Feb 2020 19:16:34 GMT
- Title: MonoLayout: Amodal scene layout from a single image
- Authors: Kaustubh Mani, Swapnil Daga, Shubhika Garg, N. Sai Shankar, Krishna
Murthy Jatavallabhula, K. Madhava Krishna
- Abstract summary: Given a single color image captured from a driving platform, we aim to predict the bird's-eye view layout of the road.
We dub this problem a scene layout estimation, which involves "hallucinating" scene layout.
To this end, we present Mono, a deep neural network for real-time amodal scene layout estimation.
- Score: 12.466845447851377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the novel, highly challenging problem of estimating
the layout of a complex urban driving scenario. Given a single color image
captured from a driving platform, we aim to predict the bird's-eye view layout
of the road and other traffic participants. The estimated layout should reason
beyond what is visible in the image, and compensate for the loss of 3D
information due to projection. We dub this problem amodal scene layout
estimation, which involves "hallucinating" scene layout for even parts of the
world that are occluded in the image. To this end, we present MonoLayout, a
deep neural network for real-time amodal scene layout estimation from a single
image. We represent scene layout as a multi-channel semantic occupancy grid,
and leverage adversarial feature learning to hallucinate plausible completions
for occluded image parts. Due to the lack of fair baseline methods, we extend
several state-of-the-art approaches for road-layout estimation and vehicle
occupancy estimation in bird's-eye view to the amodal setup for rigorous
evaluation. By leveraging temporal sensor fusion to generate training labels,
we significantly outperform current art over a number of datasets. On the KITTI
and Argoverse datasets, we outperform all baselines by a significant margin. We
also make all our annotations, and code publicly available. A video abstract of
this paper is available https://www.youtube.com/watch?v=HcroGyo6yRQ .
Related papers
- Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation [39.08243715525956]
Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision.
With the popularity of neural radiance fields (NeRFs), implicit representations also became popular for scene completion.
We propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction.
arXiv Detail & Related papers (2024-04-11T17:30:24Z) - PanoContext-Former: Panoramic Total Scene Understanding with a
Transformer [37.51637352106841]
Panoramic image enables deeper understanding and more holistic perception of $360circ$ surrounding environment.
In this paper, we propose a novel method using depth prior for holistic indoor scene understanding.
In addition, we introduce a real-world dataset for scene understanding, including photo-realistic panoramas, high-fidelity depth images, accurately annotated room layouts, and oriented object bounding boxes and shapes.
arXiv Detail & Related papers (2023-05-21T16:20:57Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - JPerceiver: Joint Perception Network for Depth, Pose and Layout
Estimation in Driving Scenes [75.20435924081585]
JPerceiver can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence.
It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO.
Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks.
arXiv Detail & Related papers (2022-07-16T10:33:59Z) - Neural Rendering in a Room: Amodal 3D Understanding and Free-Viewpoint
Rendering for the Closed Scene Composed of Pre-Captured Objects [40.59508249969956]
We present a novel solution to mimic such human perception capability based on a new paradigm of amodal 3D scene understanding with neural rendering for a closed scene.
We first learn the prior knowledge of the objects in a closed scene via an offline stage, which facilitates an online stage to understand the room with unseen furniture arrangement.
During the online stage, given a panoramic image of the scene in different layouts, we utilize a holistic neural-rendering-based optimization framework to efficiently estimate the correct 3D scene layout and deliver realistic free-viewpoint rendering.
arXiv Detail & Related papers (2022-05-05T15:34:09Z) - Self-supervised 360$^{\circ}$ Room Layout Estimation [20.062713286961326]
We present the first self-supervised method to train panoramic room layout estimation models without any labeled data.
Our approach also shows promising solutions in data-scarce scenarios and active learning, which would have an immediate value in real estate virtual tour software.
arXiv Detail & Related papers (2022-03-30T04:58:07Z) - DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene
Context Graph and Relation-based Optimization [66.25948693095604]
We propose a novel method for panoramic 3D scene understanding which recovers the 3D room layout and the shape, pose, position, and semantic category for each object from a single full-view panorama image.
Experiments demonstrate that our method outperforms existing methods on panoramic scene understanding in terms of both geometry accuracy and object arrangement.
arXiv Detail & Related papers (2021-08-24T13:55:29Z) - AutoLay: Benchmarking amodal layout estimation for autonomous driving [18.152206533685412]
AutoLay is a dataset and benchmark for amodal layout estimation from monocular images.
In addition to fine-grained attributes such as lanes, sidewalks, and vehicles, we also provide semantically annotated 3D point clouds.
arXiv Detail & Related papers (2021-08-20T08:21:11Z) - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by
Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras.
Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid.
We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z) - Free View Synthesis [100.86844680362196]
We present a method for novel view synthesis from input images that are freely distributed around a scene.
Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts.
arXiv Detail & Related papers (2020-08-12T18:16:08Z) - Single-View View Synthesis with Multiplane Images [64.46556656209769]
We apply deep learning to generate multiplane images given two or more input images at known viewpoints.
Our method learns to predict a multiplane image directly from a single image input.
It additionally generates reasonable depth maps and fills in content behind the edges of foreground objects in background layers.
arXiv Detail & Related papers (2020-04-23T17:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.