Related papers: Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

URL: http://arxiv.org/abs/2410.02757v1
Date: Thu, 3 Oct 2024 17:59:02 GMT
Title: Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Authors: Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu,
Abstract summary: We propose Loong, a new autoregressive large language models (LLMs)-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts.
Score: 76.59124981781602
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.

Related papers

SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z)
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting [29.049727807251084]
ViSMap is a system to summarise hour long videos with no-supervision. We bridge the gap between short videos (where annotated data is plentiful) and long ones (where it's not)
arXiv Detail & Related papers (2025-04-22T14:06:01Z)
Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC) We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z)
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos [25.988212332357545]
ReVisionLLM is a vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours.
arXiv Detail & Related papers (2024-11-22T12:46:50Z)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [26.72068455284472]
Video-XL is an extra-long vision language model designed for efficient hour-scale video understanding. Our model achieves promising results on popular long video understanding benchmarks.
arXiv Detail & Related papers (2024-09-22T15:13:31Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
LVCHAT: Facilitating Long Video Comprehension [25.395689904747965]
We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos. LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
arXiv Detail & Related papers (2024-02-19T11:59:14Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Retrieval-based Video Language Model for Efficient Long Video Question Answering [39.474247695753725]
We introduce a retrieval-based video language model (R-VLM) for efficient and interpretable long video QA. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks. Our experimental results validate the effectiveness of our framework for comprehending long videos.
arXiv Detail & Related papers (2023-12-08T09:48:36Z)
VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs. VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.