Pulp Motion: Framing-aware multimodal camera and human motion generation
- URL: http://arxiv.org/abs/2510.05097v1
- Date: Mon, 06 Oct 2025 17:58:34 GMT
- Title: Pulp Motion: Framing-aware multimodal camera and human motion generation
- Authors: Robin Courant, Xi Wang, David Loiseaux, Marc Christie, Vicky Kalogeiton,
- Abstract summary: We are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing.<n>We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality.<n>Our experiments show the generality and effectiveness of our method in generating on-frame coherent human-camera motions.
- Score: 23.011172300168642
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task. Code, models and data are available in our \href{https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/}{project page}.
Related papers
- EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z) - PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence [67.78835640962167]
Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses.<n>We propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters.<n>We present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering.
arXiv Detail & Related papers (2025-12-15T16:03:26Z) - UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation [7.900728371180723]
We present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion.<n>Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask.<n>Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.
arXiv Detail & Related papers (2025-04-11T00:41:25Z) - GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography [98.28272367169465]
We introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories.<n>Thanks to the comprehensive and diverse database, we train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation.<n>Experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability.
arXiv Detail & Related papers (2025-04-09T17:56:01Z) - HumanMM: Global Human Motion Recovery from Multi-shot Videos [24.273414172013933]
We present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions.<n>Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding.
arXiv Detail & Related papers (2025-03-10T17:57:03Z) - Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video [3.2195139886901813]
We present a novel framework that learns multi-level semantical dynamics and multi-frame human pose estimation.<n>Specifically, we first design a multi-masked context and pose reconstruction strategy.<n>This strategy stimulates the model to explore multi-temporal semantic relationships among frames by progressively masking the features of optical (patch) cubes and frames.
arXiv Detail & Related papers (2025-02-15T00:35:34Z) - Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.