MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability
- URL: http://arxiv.org/abs/2407.19468v1
- Date: Sun, 28 Jul 2024 11:39:40 GMT
- Title: MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability
- Authors: Buyu Liu, Kai Wang, Yansong Liu, Jun Bao, Tingting Han, Jun Yu,
- Abstract summary: MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design.
Our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples.
- Score: 17.995042743704442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-noising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis. Our code, data, and model can be found in \url{https://github.com/kkaiwwana/MVPbev}.
Related papers
- GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping [47.38125925469167]
We propose a semantic-preserving generative warping framework to generate novel views from a single image.
Our approach addresses the limitations of existing methods by conditioning the generative model on source view images.
Our model outperforms existing methods in both in-domain and out-of-domain scenarios.
arXiv Detail & Related papers (2024-05-27T15:07:04Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - BEVControl: Accurately Controlling Street-view Elements with
Multi-perspective Consistency via BEV Sketch Layout [17.389444754562252]
We propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents.
Our experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin.
arXiv Detail & Related papers (2023-08-03T09:56:31Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - MVDiffusion: Enabling Holistic Multi-view Image Generation with
Correspondence-Aware Diffusion [26.582847694092884]
This paper introduces MVDiffusion, a simple yet effective method for generating consistent multiview images.
MVDiffusion simultaneously generates all images with a global, effectively addressing the prevalent error accumulation.
arXiv Detail & Related papers (2023-07-03T15:19:17Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - Few-Shot Object Detection by Knowledge Distillation Using
Bag-of-Visual-Words Representations [58.48995335728938]
We design a novel knowledge distillation framework to guide the learning of the object detector.
We first present a novel Position-Aware Bag-of-Visual-Words model for learning a representative bag of visual words.
We then perform knowledge distillation based on the fact that an image should have consistent BoVW representations in two different feature spaces.
arXiv Detail & Related papers (2022-07-25T10:40:40Z) - Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks.
We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training.
Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z) - Exploit Clues from Views: Self-Supervised and Regularized Learning for
Multiview Object Recognition [66.87417785210772]
This work investigates the problem of multiview self-supervised learning (MV-SSL)
A novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation.
Experiments shows that the recognition and retrieval results using view invariant prototype embedding (VISPE) outperform other self-supervised learning methods.
arXiv Detail & Related papers (2020-03-28T07:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.