Related papers: EchoReel: Enhancing Action Generation of Existing Video Diffusion Models

EchoReel: Enhancing Action Generation of Existing Video Diffusion Models

URL: http://arxiv.org/abs/2403.11535v1
Date: Mon, 18 Mar 2024 07:41:19 GMT
Title: EchoReel: Enhancing Action Generation of Existing Video Diffusion Models
Authors: Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song,
Abstract summary: EchoReel is a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos. The Action Prism distills motion information from reference videos, which requires training on only a small dataset.
Score: 88.46315262023045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent large-scale video datasets have facilitated the generation of diverse open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy of VDMs in assimilating complex knowledge from these datasets remains constrained by their inherent scale, leading to suboptimal comprehension and synthesis of numerous actions. In this paper, we introduce EchoReel, a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos, which are readily accessible from databases or online repositories. EchoReel seamlessly integrates with existing VDMs, enhancing their ability to produce realistic motions without compromising their fundamental capabilities. Specifically, the Action Prism (AP), is introduced to distill motion information from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, EchoReel incorporates new action features into VDMs through the additional layers, eliminating the need for any further fine-tuning of untrained actions. Extensive experiments demonstrate that EchoReel is not merely replicating the whole content from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.

Related papers

Unified Video Action Model [47.88377984526902]
A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction. We introduce the Unified Video Action model (UVA), which jointly optimize video and action predictions to achieve both high accuracy and efficient action inference. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks.
arXiv Detail & Related papers (2025-02-28T21:38:17Z)
Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic object-to-object interactions. We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos. We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z)
REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z)
Multi-Modal Unsupervised Pre-Training for Surgical Operating Room Workflow Analysis [4.866110274299399]
We propose a novel way to fuse the multi-modal data for a single video frame or image. We treat the multi-modal data as different views to train the model in an unsupervised manner via clustering. Results show the superior performance of our approach on surgical video activity recognition and semantic segmentation.
arXiv Detail & Related papers (2022-07-16T10:32:27Z)
Self-Supervised Learning via multi-Transformation Classification for Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions. The representation of the video is learned in a self-supervised manner by classifying seven different transformations. We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z)
Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training. We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV) ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z)
Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations. We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text. We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.