BEVControl: Accurately Controlling Street-view Elements with
Multi-perspective Consistency via BEV Sketch Layout
- URL: http://arxiv.org/abs/2308.01661v4
- Date: Sat, 23 Sep 2023 06:59:18 GMT
- Title: BEVControl: Accurately Controlling Street-view Elements with
Multi-perspective Consistency via BEV Sketch Layout
- Authors: Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, Kaicheng Yu
- Abstract summary: We propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents.
Our experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin.
- Score: 17.389444754562252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Using synthesized images to boost the performance of perception models is a
long-standing research challenge in computer vision. It becomes more eminent in
visual-centric autonomous driving systems with multi-view cameras as some
long-tail scenarios can never be collected. Guided by the BEV segmentation
layouts, the existing generative networks seem to synthesize photo-realistic
street-view images when evaluated solely on scene-level metrics. However, once
zoom-in, they usually fail to produce accurate foreground and background
details such as heading. To this end, we propose a two-stage generative method,
dubbed BEVControl, that can generate accurate foreground and background
contents. In contrast to segmentation-like input, it also supports sketch style
input, which is more flexible for humans to edit. In addition, we propose a
comprehensive multi-level evaluation protocol to fairly compare the quality of
the generated scene, foreground object, and background geometry. Our extensive
experiments show that our BEVControl surpasses the state-of-the-art method,
BEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation
mIoU. In addition, we show that using images generated by BEVControl to train
the downstream perception model, it achieves on average 1.29 improvement in NDS
score.
Related papers
- OneBEV: Using One Panoramic Image for Bird's-Eye-View Semantic Mapping [25.801868221496473]
OneBEV is a novel BEV semantic mapping approach using merely a single panoramic image as input.
A distortion-aware module termed Mamba View Transformation (MVT) is specifically designed to handle the spatial distortions in panoramas.
This work advances BEV semantic mapping in autonomous driving, paving the way for more advanced and reliable autonomous systems.
arXiv Detail & Related papers (2024-09-20T21:33:53Z) - MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability [17.995042743704442]
MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design.
Our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples.
arXiv Detail & Related papers (2024-07-28T11:39:40Z) - CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow [20.550935390111686]
We introduce CLIP-BEVFormer, a novel approach to enhance the multi-view image-derived BEV backbones with ground truth information flow.
We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA.
arXiv Detail & Related papers (2024-03-13T19:21:03Z) - DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - FB-BEV: BEV Representation from Forward-Backward View Transformations [131.11787050205697]
We propose a novel View Transformation Module (VTM) for Bird-Eye-View (BEV) representation.
We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set.
arXiv Detail & Related papers (2023-08-04T10:26:55Z) - From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration [20.733451121484993]
We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration.
This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene.
We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts.
arXiv Detail & Related papers (2022-12-19T08:31:08Z) - BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View
Recognition via Perspective Supervision [101.36648828734646]
We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones.
The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.
arXiv Detail & Related papers (2022-11-18T18:59:48Z) - Delving into the Devils of Bird's-eye-view Perception: A Review,
Evaluation and Recipe [115.31507979199564]
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia.
As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.
The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios.
arXiv Detail & Related papers (2022-09-12T15:29:13Z) - GitNet: Geometric Prior-based Transformation for Birds-Eye-View
Segmentation [105.19949897812494]
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving.
We present a novel two-stage Geometry Prior-based Transformation framework named GitNet.
arXiv Detail & Related papers (2022-04-16T06:46:45Z) - BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary
Camera Rigs [3.5728676902207988]
We present an effective transformer-based method for BEV semantic segmentation from arbitrary camera rigs.
Specifically, our method first encodes image features from arbitrary cameras with a shared backbone.
An efficient multi-camera deformable attention unit is designed to carry out the BEV-to-image view transformation.
arXiv Detail & Related papers (2022-03-08T12:39:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.