CineLOG: A Training Free Approach for Cinematic Long Video Generation
- URL: http://arxiv.org/abs/2512.12209v1
- Date: Sat, 13 Dec 2025 06:44:09 GMT
- Title: CineLOG: A Training Free Approach for Cinematic Long Video Generation
- Authors: Zahra Dehghanian, Morteza Abolghasemi, Hamid Beigy, Hamid R. Rabiee,
- Abstract summary: We introduce CineLOG, a new dataset of 5,000 high quality, balanced video clips.<n>Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy.<n>We present our novel pipeline designed to create this dataset, which decouples the complex text to genre video (T2V) task generation into four easier stages with more mature technology.
- Score: 19.97092710696699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controllable video synthesis is a central challenge in computer vision, yet current models struggle with fine grained control beyond textual prompts, particularly for cinematic attributes like camera trajectory and genre. Existing datasets often suffer from severe data imbalance, noisy labels, or a significant simulation to real gap. To address this, we introduce CineLOG, a new dataset of 5,000 high quality, balanced, and uncut video clips. Each entry is annotated with a detailed scene description, explicit camera instructions based on a standard cinematic taxonomy, and genre label, ensuring balanced coverage across 17 diverse camera movements and 15 film genres. We also present our novel pipeline designed to create this dataset, which decouples the complex text to video (T2V) generation task into four easier stages with more mature technology. To enable coherent, multi shot sequences, we introduce a novel Trajectory Guided Transition Module that generates smooth spatio-temporal interpolation. Extensive human evaluations show that our pipeline significantly outperforms SOTA end to end T2V models in adhering to specific camera and screenplay instructions, while maintaining professional visual quality. All codes and data are available at https://cine-log.pages.dev.
Related papers
- CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation [65.03946626081036]
We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation.<n>CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation.
arXiv Detail & Related papers (2026-02-06T18:59:24Z) - CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models [28.224969852134606]
We introduce CineTrans, a framework for generating coherent multi-shot videos with cinematic, film-style transitions.<n>CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations.
arXiv Detail & Related papers (2025-08-15T13:58:22Z) - OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions [77.04071342405055]
We develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video.<n>We also propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE)<n>Our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-06-29T18:43:00Z) - CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition [23.795982778641573]
We present CineVerse, a novel framework for the task of cinematic scene composition.<n>Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames.<n>Our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects.
arXiv Detail & Related papers (2025-04-28T15:28:14Z) - SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework.<n>We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method.<n>We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning.<n>DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data [14.489919164476982]
High-quality (HQ) video synthesis is challenging because of the diverse and complex motions existed in real world.
Most existing works struggle to address this problem by collecting large-scale captions, which are inaccessible to the community.
We show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning.
arXiv Detail & Related papers (2024-08-19T16:08:00Z) - Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences.
We introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration.
Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation.
arXiv Detail & Related papers (2024-04-20T13:15:27Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.