Related papers: Video Generation with Learned Action Prior

Video Generation with Learned Action Prior

URL: http://arxiv.org/abs/2406.14436v1
Date: Thu, 20 Jun 2024 16:00:07 GMT
Title: Video Generation with Learned Action Prior
Authors: Meenakshi Sarkar, Devansh Bhardwaj, Debasish Ghose,
Abstract summary: Video generation is particularly challenging when the camera is mounted on a moving platform as camera motion interacts with image pixels. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera or action as part of the observed image state, modelling both image state and action within a multi-AP learning framework.
Score: 1.740992908651449
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM.

Related papers

Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects. Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities. We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z)
Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video [3.2195139886901813]
We present a novel framework that learns multi-level semantical dynamics and multi-frame human pose estimation. Specifically, we first design a multi-masked context and pose reconstruction strategy. This strategy stimulates the model to explore multi-temporal semantic relationships among frames by progressively masking the features of optical (patch) cubes and frames.
arXiv Detail & Related papers (2025-02-15T00:35:34Z)
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators. VideoJAM achieves state-of-the-art performance in motion coherence. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z)
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [56.5742116979914]
CustomCrafter preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage.
arXiv Detail & Related papers (2024-08-23T17:26:06Z)
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z)
Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation [15.569467643817447]
We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. We train on real-world videos enhanced with this innovative motion depiction approach. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
arXiv Detail & Related papers (2024-05-26T00:53:26Z)
Action-conditioned video data improves predictability [1.9567015559455132]
Action-Conditioned Video Generation (ACVG) framework generates video sequences conditioned on the actions of robots. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another.
arXiv Detail & Related papers (2024-04-08T12:18:01Z)
Diffusion Priors for Dynamic View Synthesis from Monocular Videos [59.42406064983643]
Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. We first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. We distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields.
arXiv Detail & Related papers (2024-01-10T23:26:41Z)
Learning Variational Motion Prior for Video-based Motion Capture [31.79649766268877]
We present a novel variational motion prior (VMP) learning approach for video-based motion capture. Our framework can effectively reduce temporal jittering and failure modes in frame-wise pose estimation. Experiments over both public datasets and in-the-wild videos have demonstrated the efficacy and generalization capability of our framework.
arXiv Detail & Related papers (2022-10-27T02:45:48Z)
End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos. We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z)
Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [51.17064599766138]
We have developed a method to identify independently objects acquired with an event-based camera. The method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
arXiv Detail & Related papers (2020-12-16T04:06:02Z)
Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network. Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.