VC-LLM: Automated Advertisement Video Creation from Raw Footage using Multi-modal LLMs
- URL: http://arxiv.org/abs/2504.05673v1
- Date: Tue, 08 Apr 2025 04:35:23 GMT
- Title: VC-LLM: Automated Advertisement Video Creation from Raw Footage using Multi-modal LLMs
- Authors: Dongjun Qian, Kai Su, Yiming Tan, Qishuai Diao, Xian Wu, Chang Liu, Bingyue Peng, Zehuan Yuan,
- Abstract summary: We present VC-LLM, a framework powered by Large Language Models for the automatic creation of high-quality short-form advertisement videos.<n>Our approach leverages high-resolution spatial input and low-resolution temporal input to represent video clips more effectively.<n> Experiments show that VC-LLM based on GPT-4o can produce videos comparable to those created by humans.
- Score: 43.50425781768217
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As short videos have risen in popularity, the role of video content in advertising has become increasingly significant. Typically, advertisers record a large amount of raw footage about the product and then create numerous different short-form advertisement videos based on this raw footage. Creating such videos mainly involves editing raw footage and writing advertisement scripts, which requires a certain level of creative ability. It is usually challenging to create many different video contents for the same product, and manual efficiency is often low. In this paper, we present VC-LLM, a framework powered by Large Language Models for the automatic creation of high-quality short-form advertisement videos. Our approach leverages high-resolution spatial input and low-resolution temporal input to represent video clips more effectively, capturing both fine-grained visual details and broader temporal dynamics. In addition, during training, we incorporate supplementary information generated by rewriting the ground truth text, ensuring that all key output information can be directly traced back to the input, thereby reducing model hallucinations. We also designed a benchmark to evaluate the quality of the created videos. Experiments show that VC-LLM based on GPT-4o can produce videos comparable to those created by humans. Furthermore, we collected numerous high-quality short advertisement videos to create a pre-training dataset and manually cleaned a portion of the data to construct a high-quality fine-tuning dataset. Experiments indicate that, on the benchmark, the VC-LLM based on fine-tuned LLM can produce videos with superior narrative logic compared to those created by the VC-LLM based on GPT-4o.
Related papers
- Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.
We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z) - Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision [24.568643475808564]
Video Self-Training with augmented Reasoning (Video-STaR) is the first video self-training approach.
Video-STaR allows the utilization of any labeled video dataset for video instruction tuning.
arXiv Detail & Related papers (2024-07-08T17:59:42Z) - Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models [29.825619120260164]
This paper addresses the challenge by leveraging the visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs.
We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data.
arXiv Detail & Related papers (2024-06-12T09:22:45Z) - MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies [21.489102981760766]
MovieLLM is a novel framework designed to synthesize consistent and high-quality video data for instruction tuning.
Our experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives.
arXiv Detail & Related papers (2024-03-03T07:43:39Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VideoStudio: Generating Consistent-Content and Multi-Scene Videos [88.88118783892779]
VideoStudio is a framework for consistent-content and multi-scene video generation.
VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script.
VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.
arXiv Detail & Related papers (2024-01-02T15:56:48Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Dynamic Storyboard Generation in an Engine-based Virtual Environment for
Video Production [92.14891282042764]
We present Virtual Dynamic Storyboard (VDS) to allow users storyboarding shots in virtual environments.
VDS runs on a "propose-simulate-discriminate" mode: Given a formatted story script and a camera script as input, it generates several character animation and camera movement proposals.
To pick up the top-quality dynamic storyboard from the candidates, we equip it with a shot ranking discriminator based on shot quality criteria learned from professional manual-created data.
arXiv Detail & Related papers (2023-01-30T06:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.