Related papers: ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models

ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models

URL: http://arxiv.org/abs/2508.05236v1
Date: Thu, 07 Aug 2025 10:24:47 GMT
Title: ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models
Authors: Yatong Lan, Jingfeng Chen, Yiru Wang, Lei He,
Abstract summary: Arbiviewgen is a novel framework for the generation of controllable camera images from arbitrary points of view.<n>We introduce two key components: Feature-Aware Adaptive View Stitching and Cross-View Consistency Self-Supervised Learning.
Score: 8.314980817044958
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.

Related papers

DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving [9.882070476776274]
We present a generalizable camera simulation framework DriveCamSim.<n>Our core innovation lies in the proposed Explicit Camera Modeling mechanism.<n>For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines.
arXiv Detail & Related papers (2025-05-26T08:50:15Z)
SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification [61.753607285860944]
We propose a novel two-stage feature learning framework named SD-ReID for AG-ReID.<n>In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions.<n>In the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions.
arXiv Detail & Related papers (2025-04-13T12:44:50Z)
Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes [3.2416801263793285]
We propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations.<n>Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task.<n>Our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches.
arXiv Detail & Related papers (2025-03-17T21:48:56Z)
HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation [106.09886920774002]
We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network. Our method achieves consistent improvements over the baseline trained from scratch and significantly out- performs the existing schemes.
arXiv Detail & Related papers (2024-03-18T14:18:08Z)
Multi-View Unsupervised Image Generation with Cross Attention Guidance [23.07929124170851]
This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
arXiv Detail & Related papers (2023-12-07T14:55:13Z)
Cross-View Cross-Scene Multi-View Crowd Counting [56.83882084112913]
Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera. We propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts.
arXiv Detail & Related papers (2022-05-03T15:03:44Z)
Camera-Conditioned Stable Feature Generation for Isolated Camera Supervised Person Re-IDentification [24.63519986072777]
Cross-camera images could be unavailable under the ISolated Camera Supervised setting, e.g., a surveillance system deployed across distant scenes. A new pipeline is introduced by synthesizing the cross-camera samples in the feature space for model training. Experiments on two ISCS person Re-ID datasets demonstrate the superiority of our CCSFG to the competitors.
arXiv Detail & Related papers (2022-03-29T03:10:24Z)
Towards Unsupervised Deep Image Enhancement with Generative Adversarial Network [92.01145655155374]
We present an unsupervised image enhancement generative network (UEGAN) It learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner. Results show that the proposed model effectively improves the aesthetic quality of images.
arXiv Detail & Related papers (2020-12-30T03:22:46Z)
Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner. We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z)
Towards Coding for Human and Machine Vision: A Scalable Image Coding Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models. By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels. Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.