Vlogger: Make Your Dream A Vlog
- URL: http://arxiv.org/abs/2401.09414v1
- Date: Wed, 17 Jan 2024 18:55:12 GMT
- Title: Vlogger: Make Your Dream A Vlog
- Authors: Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu
Qiao, Yali Wang
- Abstract summary: Vlogger is a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions.
We invoke various foundation models to play the critical roles of vlog professionals, including Script, (2) Actor, (3) ShowMaker, and (4) Voicer.
Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.
- Score: 67.50445251570173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present Vlogger, a generic AI system for generating a
minute-level video blog (i.e., vlog) of user descriptions. Different from short
videos with a few seconds, vlog often contains a complex storyline with
diversified scenes, which is challenging for most existing video generation
approaches. To break through this bottleneck, our Vlogger smartly leverages
Large Language Model (LLM) as Director and decomposes a long video generation
task of vlog into four key stages, where we invoke various foundation models to
play the critical roles of vlog professionals, including (1) Script, (2) Actor,
(3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings,
our Vlogger can generate vlogs through explainable cooperation of top-down
planning and bottom-up shooting. Moreover, we introduce a novel video diffusion
model, ShowMaker, which serves as a videographer in our Vlogger for generating
the video snippet of each shooting scene. By incorporating Script and Actor
attentively as textual and visual prompts, it can effectively enhance
spatial-temporal coherence in the snippet. Besides, we design a concise mixed
training paradigm for ShowMaker, boosting its capacity for both T2V generation
and prediction. Finally, the extensive experiments show that our method
achieves state-of-the-art performance on zero-shot T2V generation and
prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs
from open-world descriptions, without loss of video coherence on script and
actor. The code and model is all available at
https://github.com/zhuangshaobin/Vlogger.
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - ShareGPT4Video: Improving Video Understanding and Generation with Better Captions [93.29360532845062]
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions.
The series comprises: ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy.
We further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos.
arXiv Detail & Related papers (2024-06-06T17:58:54Z) - A Vlogger-augmented Graph Neural Network Model for Micro-video Recommendation [7.54949302096348]
We propose a vlogger-augmented graph neural network model VA-GNN, which takes the effect of vloggers into consideration.
Specifically, we construct a tripartite graph with users, micro-videos, and vloggers as nodes, capturing user preferences from different views.
When predicting the next user-video interaction, we adaptively combine the user preferences for a video itself and its vlogger.
arXiv Detail & Related papers (2024-05-28T15:13:29Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - Dynamic Storyboard Generation in an Engine-based Virtual Environment for
Video Production [92.14891282042764]
We present Virtual Dynamic Storyboard (VDS) to allow users storyboarding shots in virtual environments.
VDS runs on a "propose-simulate-discriminate" mode: Given a formatted story script and a camera script as input, it generates several character animation and camera movement proposals.
To pick up the top-quality dynamic storyboard from the candidates, we equip it with a shot ranking discriminator based on shot quality criteria learned from professional manual-created data.
arXiv Detail & Related papers (2023-01-30T06:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.