AICL: Action In-Context Learning for Video Diffusion Model
- URL: http://arxiv.org/abs/2403.11535v2
- Date: Fri, 23 Aug 2024 07:02:50 GMT
- Title: AICL: Action In-Context Learning for Video Diffusion Model
- Authors: Jianzhi Liu, Junchen Zhu, Lianli Gao, Heng Tao Shen, Jingkuan Song,
- Abstract summary: We propose AICL, which empowers the generative model with the ability to understand action information in reference videos.
Extensive experiments demonstrate that AICL effectively captures the action and achieves state-of-the-art generation performance.
- Score: 124.39948693332552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The open-domain video generation models are constrained by the scale of the training video datasets, and some less common actions still cannot be generated. Some researchers explore video editing methods and achieve action generation by editing the spatial information of the same action video. However, this method mechanically generates identical actions without understanding, which does not align with the characteristics of open-domain scenarios. In this paper, we propose AICL, which empowers the generative model with the ability to understand action information in reference videos, similar to how humans do, through in-context learning. Extensive experiments demonstrate that AICL effectively captures the action and achieves state-of-the-art generation performance across three typical video diffusion models on five metrics when using randomly selected categories from non-training datasets.
Related papers
- Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z) - Multi-Modal Unsupervised Pre-Training for Surgical Operating Room
Workflow Analysis [4.866110274299399]
We propose a novel way to fuse the multi-modal data for a single video frame or image.
We treat the multi-modal data as different views to train the model in an unsupervised manner via clustering.
Results show the superior performance of our approach on surgical video activity recognition and semantic segmentation.
arXiv Detail & Related papers (2022-07-16T10:32:27Z) - Self-Supervised Learning via multi-Transformation Classification for
Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions.
The representation of the video is learned in a self-supervised manner by classifying seven different transformations.
We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z) - Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training.
We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV)
ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.