FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
- URL: http://arxiv.org/abs/2505.03730v1
- Date: Tue, 06 May 2025 17:58:02 GMT
- Title: FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
- Authors: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang,
- Abstract summary: Action customization involves generating videos where the subject performs actions dictated by input control signals.<n>Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure.<n>We propose FlexiAct, which transfers actions from a reference video to an arbitrary target image.
- Score: 49.09128364751743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/
Related papers
- Olaf-World: Orienting Latent Actions for Video World Modeling [100.96069208914957]
Scaling action-controllable world models is limited by the scarcity of action labels.<n>We present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video.
arXiv Detail & Related papers (2026-02-10T18:58:41Z) - One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer [36.26551019954542]
We present One-to-All Animation, a framework for high-fidelity character animation and image pose transfer.<n>To handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task.<n>We also design a reference extractor for comprehensive identity feature extraction.
arXiv Detail & Related papers (2025-11-28T07:30:10Z) - Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising [23.044483059783143]
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control.<n>We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation.
arXiv Detail & Related papers (2025-11-09T22:47:50Z) - SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z) - ATI: Any Trajectory Instruction for Controllable Video Generation [25.249489701215467]
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion.<n>Our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models.
arXiv Detail & Related papers (2025-05-28T23:49:18Z) - Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects.<n>Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion.<n>Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities.<n>We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z) - Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z) - Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based
Action Recognition [33.68311764817763]
We propose an Actionlet-Dependent Contrastive Learning method (ActCLR)
The actionlet, defined as the discriminative subset of the human skeleton, effectively decomposes motion regions for better action modeling.
Different data transformations are applied to actionlet and non-actionlet regions to introduce more diversity while maintaining their own characteristics.
arXiv Detail & Related papers (2023-03-20T06:47:59Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [51.17064599766138]
We have developed a method to identify independently objects acquired with an event-based camera.
The method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
arXiv Detail & Related papers (2020-12-16T04:06:02Z) - Learning to Manipulate Individual Objects in an Image [71.55005356240761]
We describe a method to train a generative model with latent factors that are independent and localized.
This means that perturbing the latent variables affects only local regions of the synthesized image, corresponding to objects.
Unlike other unsupervised generative models, ours enables object-centric manipulation, without requiring object-level annotations.
arXiv Detail & Related papers (2020-04-11T21:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.