Related papers: Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

URL: http://arxiv.org/abs/2412.04432v1
Date: Thu, 05 Dec 2024 18:53:04 GMT
Title: Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Authors: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan,
Abstract summary: Divot is a Diffusion-Powered Video Tokenizer.<n>We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
Score: 54.21476271127356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.

Related papers

Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models [10.585096070697348]
We introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions.<n>VideoNarrator addresses challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models can function as caption generators.<n>Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations.
arXiv Detail & Related papers (2025-07-22T22:16:37Z)
Unsupervised Transcript-assisted Video Summarization and Highlight Detection [6.80224810039938]
We propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video.<n>The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries.<n>Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.
arXiv Detail & Related papers (2025-05-29T09:16:19Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process. Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance. During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z)
SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models. Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation. VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script. VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.