EchoReel: Enhancing Action Generation of Existing Video Diffusion Models
- URL: http://arxiv.org/abs/2403.11535v1
- Date: Mon, 18 Mar 2024 07:41:19 GMT
- Title: EchoReel: Enhancing Action Generation of Existing Video Diffusion Models
- Authors: Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song,
- Abstract summary: EchoReel is a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos.
The Action Prism distills motion information from reference videos, which requires training on only a small dataset.
- Score: 88.46315262023045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent large-scale video datasets have facilitated the generation of diverse open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy of VDMs in assimilating complex knowledge from these datasets remains constrained by their inherent scale, leading to suboptimal comprehension and synthesis of numerous actions. In this paper, we introduce EchoReel, a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos, which are readily accessible from databases or online repositories. EchoReel seamlessly integrates with existing VDMs, enhancing their ability to produce realistic motions without compromising their fundamental capabilities. Specifically, the Action Prism (AP), is introduced to distill motion information from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, EchoReel incorporates new action features into VDMs through the additional layers, eliminating the need for any further fine-tuning of untrained actions. Extensive experiments demonstrate that EchoReel is not merely replicating the whole content from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.
Related papers
- Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition [24.217068565936117]
We present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video.
To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices.
Experiments show our method can achieve state-of-the-art performance on multiple public datasets.
arXiv Detail & Related papers (2024-07-09T07:53:16Z) - Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality [26.55645677311152]
Video paragraph captioning (VPC) involves generating detailed narratives for long videos.
Existing models are constrained by the assumption of constant availability of a single auxiliary modality.
We propose a Missing-Resistant framework that harnesses all available auxiliary inputs and maintains resilience even in the absence of certain modalities.
arXiv Detail & Related papers (2024-03-28T08:35:46Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs [112.39389727164594]
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches.
While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the temporal dynamics modeling, one of the crux of video synthesis.
In this work, we investigate strengthening awareness of video dynamics for DMs, for high-quality T2V generation
arXiv Detail & Related papers (2023-08-26T08:31:48Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive
Survey [32.526118672614345]
Video analysis tasks such as action recognition have received increasing research interest with growing applications in fields such as smart healthcare.
Video models trained on existing datasets suffer from significant performance degradation when deployed directly to real-world applications.
Video unsupervised domain adaptation (VUDA) is introduced to adapt video models from the labeled source domain to the unlabeled target domain.
arXiv Detail & Related papers (2022-11-17T05:05:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.