Video-GPT via Next Clip Diffusion
- URL: http://arxiv.org/abs/2505.12489v2
- Date: Wed, 21 May 2025 04:44:19 GMT
- Title: Video-GPT via Next Clip Diffusion
- Authors: Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, Yali Wang,
- Abstract summary: GPT has shown its remarkable success in natural language processing.<n>We treat video as new language for visual world modeling.<n>We introduce a novel next clip diffusion paradigm for pretraining Video-GPT.
- Score: 14.832916520268105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://zhuangshaobin.github.io/Video-GPT.github.io/.
Related papers
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions [93.29360532845062]
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions.
The series comprises: ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy.
We further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos.
arXiv Detail & Related papers (2024-06-06T17:58:54Z) - Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT [1.614471032380076]
Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query.
Most existing VTG models are trained on extensive annotated video-text pairs.
We propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning.
arXiv Detail & Related papers (2024-03-04T14:22:02Z) - VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.<n>We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.<n>We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - WAIT: Feature Warping for Animation to Illustration video Translation using GANs [11.968412857420192]
We introduce a new problem for video stylizing where an unordered set of images are used.<n>Most of the video-to-video translation methods are built on an image-to-image translation model.<n>We propose a new generator network with feature warping layers which overcomes the limitations of the previous methods.
arXiv Detail & Related papers (2023-10-07T19:45:24Z) - Tell Me What Happened: Unifying Text-guided Video Completion via
Multimodal Masked Video Generation [82.26026492545533]
We introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task.
At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions.
arXiv Detail & Related papers (2022-11-23T10:14:12Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.