Preacher: Paper-to-Video Agentic System
- URL: http://arxiv.org/abs/2508.09632v6
- Date: Mon, 08 Sep 2025 11:42:40 GMT
- Title: Preacher: Paper-to-Video Agentic System
- Authors: Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang,
- Abstract summary: Preacher is the first paper-to-video agentic system.<n>It decomposes, summarize, and reformulate a research paper into a structured video abstract.<n>It generates high-quality video abstracts across five research fields.
- Score: 58.34155339878016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/Gen-Verse/Paper2Video
Related papers
- Paper2Video: Automatic Video Generation from Scientific Papers [62.634562246594555]
Paper2Video is the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata.<n>We propose PaperTalker, the first multi-agent framework for academic presentation video generation.
arXiv Detail & Related papers (2025-10-06T17:58:02Z) - Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version) [18.484276267960436]
A promising solution is to augment the reasoning performance with multiple related videos.<n>Video tokens are numerous and contain redundant information.<n>We propose a multi-video collaborative framework for video language models.
arXiv Detail & Related papers (2025-09-16T15:13:21Z) - REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing [56.992916488077476]
In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video.<n>We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative.<n>Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative.
arXiv Detail & Related papers (2025-05-24T21:36:49Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [74.61964363605632]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.<n>We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Gen-L-Video: Multi-Text to Long Video Generation via Temporal
Co-Denoising [43.35391175319815]
This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos.
We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models.
Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
arXiv Detail & Related papers (2023-05-29T17:38:18Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.