Related papers: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

URL: http://arxiv.org/abs/2505.01729v2
Date: Fri, 18 Jul 2025 07:43:15 GMT
Title: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth
Authors: Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao,
Abstract summary: We introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models.<n>Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences.<n>Experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning.
Score: 9.737257599532956
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

Related papers

Walk through Paintings: Egocentric World Models from Internet Priors [65.30611174953958]
We present the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model.<n>Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers.<n>Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids.
arXiv Detail & Related papers (2026-01-21T18:59:32Z)
PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence [67.78835640962167]
Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses.<n>We propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters.<n>We present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering.
arXiv Detail & Related papers (2025-12-15T16:03:26Z)
Generative Photographic Control for Scene-Consistent Video Cinematic Editing [75.45726688666083]
We propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters.<n>We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs.<n>Our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.
arXiv Detail & Related papers (2025-11-17T03:17:23Z)
PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z)
GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control [50.67481583744243]
We introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models.<n>We propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles.<n>Our method significantly outperforms existing models in both action accuracy and 3D spatial awareness.
arXiv Detail & Related papers (2025-05-28T14:46:51Z)
DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving [9.882070476776274]
We present a generalizable camera simulation framework DriveCamSim.<n>Our core innovation lies in the proposed Explicit Camera Modeling mechanism.<n>For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines.
arXiv Detail & Related papers (2025-05-26T08:50:15Z)
Dynamic Camera Poses and Where to Find Them [36.249380390918816]
We introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses.<n>For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion.<n>Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction [78.27956235915622]
Traditional SLAM systems struggle with highly dynamic scenes commonly found in casual videos.<n>This work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects.<n>Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end.
arXiv Detail & Related papers (2025-04-20T07:29:42Z)
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.<n>Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.<n>We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z)
CamI2V: Camera-Controlled Image-to-Video Diffusion Model [11.762824216082508]
Integrated camera pose is a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control.<n>We identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability.<n>We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition.
arXiv Detail & Related papers (2024-10-21T12:36:27Z)
UniDrive: Towards Universal Driving Perception Across Camera Configurations [38.40168936403638]
3D perception aims to infer 3D information from 2D images based on 3D-2D projection.<n>Generalizing across camera configurations is important for deploying autonomous driving models on different car models.<n>We present UniDrive, a novel framework for vision-centric autonomous driving to achieve universal perception across camera configurations.
arXiv Detail & Related papers (2024-10-17T17:59:59Z)
VICAN: Very Efficient Calibration Algorithm for Large Camera Networks [49.17165360280794]
We introduce a novel methodology that extends Pose Graph Optimization techniques. We consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step. Our framework retains compatibility with traditional PGO solvers, but its efficacy benefits from a custom-tailored optimization scheme.
arXiv Detail & Related papers (2024-03-25T17:47:03Z)
Robust Self-Supervised Extrinsic Self-Calibration [25.727912226753247]
Multi-camera self-supervised monocular depth estimation from videos is a promising way to reason about the environment. We introduce a novel method for extrinsic calibration that builds upon the principles of self-supervised monocular depth and ego-motion learning.
arXiv Detail & Related papers (2023-08-04T06:20:20Z)
SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z)
GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras [99.07219478953982]
We present an approach for 3D global human mesh recovery from monocular videos recorded with dynamic cameras. We first propose a deep generative motion infiller, which autoregressively infills the body motions of occluded humans based on visible motions. In contrast to prior work, our approach reconstructs human meshes in consistent global coordinates even with dynamic cameras.
arXiv Detail & Related papers (2021-12-02T18:59:54Z)
TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation [77.09542018140823]
We propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes.
arXiv Detail & Related papers (2021-05-28T19:08:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.