Video-ReTime: Learning Temporally Varying Speediness for Time Remapping
- URL: http://arxiv.org/abs/2205.05609v1
- Date: Wed, 11 May 2022 16:27:47 GMT
- Title: Video-ReTime: Learning Temporally Varying Speediness for Time Remapping
- Authors: Simon Jenni, Markus Woodson, Fabian Caba Heilbron
- Abstract summary: We train a neural network through self-supervision to recognize and accurately localize changes in the video playback speed.
We demonstrate that this model can detect playback speed variations more accurately while also being orders of magnitude more efficient than prior approaches.
- Score: 12.139222986297263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method for generating a temporally remapped video that matches
the desired target duration while maximally preserving natural video dynamics.
Our approach trains a neural network through self-supervision to recognize and
accurately localize temporally varying changes in the video playback speed. To
re-time videos, we 1. use the model to infer the slowness of individual video
frames, and 2. optimize the temporal frame sub-sampling to be consistent with
the model's slowness predictions. We demonstrate that this model can detect
playback speed variations more accurately while also being orders of magnitude
more efficient than prior approaches. Furthermore, we propose an optimization
for video re-timing that enables precise control over the target duration and
performs more robustly on longer videos than prior methods. We evaluate the
model quantitatively on artificially speed-up videos, through transfer to
action recognition, and qualitatively through user studies.
Related papers
- Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search [23.3627657867351]
An alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content.
We propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward.
We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update.
arXiv Detail & Related papers (2025-01-31T16:09:30Z) - Temporal Preference Optimization for Long-Form Video Understanding [28.623353303256653]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.
TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.
LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM [54.2320450886902]
Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs.
Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware.
We introduce Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.
arXiv Detail & Related papers (2024-12-19T18:32:21Z) - VEnhancer: Generative Space-Time Enhancement for Video Generation [123.37212575364327]
VEnhancer improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain.
We train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos.
VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos.
arXiv Detail & Related papers (2024-07-10T13:46:08Z) - UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - View while Moving: Efficient Video Recognition in Long-untrimmed Videos [17.560160747282147]
We propose a novel recognition paradigm "View while Moving" for efficient long-untrimmed video recognition.
In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.
Our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency.
arXiv Detail & Related papers (2023-08-09T09:46:26Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution
Video Prediction [78.129039340528]
We propose a StemporalResidual Predictive Model (STRPM) for high-resolution video prediction.
STRPM can generate more satisfactory results compared with various existing methods.
Experimental results show that STRPM can generate more satisfactory results compared with various existing methods.
arXiv Detail & Related papers (2022-03-30T06:24:00Z) - Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video [23.478555947694108]
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
We learn the meta-temporal visual variations in the video by leveraging the variations in the visual appearance according to playback speeds.
We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks.
arXiv Detail & Related papers (2020-03-05T15:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.