Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions
- URL: http://arxiv.org/abs/2403.07198v1
- Date: Mon, 11 Mar 2024 22:46:46 GMT
- Title: Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions
- Authors: Lan Wang, Vishnu Boddeti, and Sernam Lim
- Abstract summary: ReimaginedAct comprises video understanding, reasoning, and editing modules.
Our method can accept not only direct instructional text prompts but also what if' questions to predict possible action changes.
- Score: 49.14827857853878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel text-to-pose video editing method, ReimaginedAct. While
existing video editing tasks are limited to changes in attributes, backgrounds,
and styles, our method aims to predict open-ended human action changes in
video. Moreover, our method can accept not only direct instructional text
prompts but also `what if' questions to predict possible action changes.
ReimaginedAct comprises video understanding, reasoning, and editing modules.
First, an LLM is utilized initially to obtain a plausible answer for the
instruction or question, which is then used for (1) prompting Grounded-SAM to
produce bounding boxes of relevant individuals and (2) retrieving a set of pose
videos that we have collected for editing human actions. The retrieved pose
videos and the detected individuals are then utilized to alter the poses
extracted from the original video. We also employ a timestep blending module to
ensure the edited video retains its original content except where necessary
modifications are needed. To facilitate research in text-to-pose video editing,
we introduce a new evaluation dataset, WhatifVideo-1.0. This dataset includes
videos of different scenarios spanning a range of difficulty levels, along with
questions and text prompts. Experimental results demonstrate that existing
video editing methods struggle with human action editing, while our approach
can achieve effective action editing and even imaginary editing from
counterfactual questions.
Related papers
- A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model [10.736207095604414]
We propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM)
We also propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions.
arXiv Detail & Related papers (2024-11-07T18:20:28Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - ReVideo: Remake a Video with Motion and Content Control [67.5923127902463]
We present a novel attempt to Remake a Video (VideoRe) which allows precise video editing in specific areas through the specification of both content and motion.
VideoRe addresses a new task involving the coupling and training imbalance between content and motion control.
Our method can also seamlessly extend these applications to multi-area editing without modifying specific training, demonstrating its flexibility and robustness.
arXiv Detail & Related papers (2024-05-22T17:46:08Z) - GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models.
Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z) - Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion [19.969947635371]
Videoshop is a training-free video editing algorithm for localized semantic edits.
It allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance.
Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
arXiv Detail & Related papers (2024-03-21T17:59:03Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - Editing 3D Scenes via Text Prompts without Retraining [80.57814031701744]
DN2N is a text-driven editing method that allows for the direct acquisition of a NeRF model with universal editing capabilities.
Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images.
Our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer.
arXiv Detail & Related papers (2023-09-10T02:31:50Z) - The Anatomy of Video Editing: A Dataset and Benchmark Suite for
AI-Assisted Video Editing [90.59584961661345]
This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing.
Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling.
To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes.
arXiv Detail & Related papers (2022-07-20T10:53:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.