VideoAgent: Personalized Synthesis of Scientific Videos
- URL: http://arxiv.org/abs/2509.11253v1
- Date: Sun, 14 Sep 2025 12:54:21 GMT
- Title: VideoAgent: Personalized Synthesis of Scientific Videos
- Authors: Xiao Liang, Bangxin Li, Zixuan Chen, Hanyue Zheng, Zhi Ma, Di Wang, Cong Tian, Quan Wang,
- Abstract summary: VideoAgent is a novel multi-agent framework that synthesizes personalized scientific videos through a conversational interface.<n>VideoAgent parses a source paper into a fine-grained asset library and orchestrates a narrative flow that synthesizes both static slides and dynamic animations to explain complex concepts.<n>SciVidEval is the first comprehensive suite for this task, which combines automated metrics for multimodal content quality and synchronization with a Video-Quiz-based human evaluation to measure knowledge transfer.
- Score: 24.440349159498286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating the generation of scientific videos is a crucial yet challenging task for effective knowledge dissemination. However, existing works on document automation primarily focus on static media such as posters and slides, lacking mechanisms for personalized dynamic orchestration and multimodal content synchronization. To address these challenges, we introduce VideoAgent, a novel multi-agent framework that synthesizes personalized scientific videos through a conversational interface. VideoAgent parses a source paper into a fine-grained asset library and, guided by user requirements, orchestrates a narrative flow that synthesizes both static slides and dynamic animations to explain complex concepts. To enable rigorous evaluation, we also propose SciVidEval, the first comprehensive suite for this task, which combines automated metrics for multimodal content quality and synchronization with a Video-Quiz-based human evaluation to measure knowledge transfer. Extensive experiments demonstrate that our method significantly outperforms existing commercial scientific video generation services and approaches human-level quality in scientific communication.
Related papers
- EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z) - ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation [48.59900036213667]
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image.<n>We introduce ID-Composer, a novel framework that tackles multi-subject video generation from a text prompt and reference images.
arXiv Detail & Related papers (2025-11-01T11:29:14Z) - Paper2Video: Automatic Video Generation from Scientific Papers [62.634562246594555]
Paper2Video is the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata.<n>We propose PaperTalker, the first multi-agent framework for academic presentation video generation.
arXiv Detail & Related papers (2025-10-06T17:58:02Z) - PresentAgent: Multimodal Agent for Presentation Video Generation [30.274831875701217]
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos.<n>To achieve this integration, PresentAgent employs a modular pipeline that segments the input document, plans and renders slide-style visual frames.<n>Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models.
arXiv Detail & Related papers (2025-07-05T13:24:15Z) - Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward [8.470241117250243]
This paper focuses on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation.<n>It offers insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis.<n>This research underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies.
arXiv Detail & Related papers (2025-05-31T11:02:24Z) - MAGREF: Masked Guidance for Any-Reference Video Generation [33.35245169242822]
MAGREF is a unified framework for any-reference video generation.<n>We propose a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference.<n>Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios.
arXiv Detail & Related papers (2025-05-29T17:58:15Z) - RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.<n>We advocate for the incorporation of a retrieval mechanism during the generation phase.<n>Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z) - Llama Learns to Direct: DirectorLLM for Human-Centric Video Generation [54.561971554162376]
We introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos.<n>Our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
arXiv Detail & Related papers (2024-12-19T03:10:26Z) - SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input [6.275971782566314]
We introduce a novel self-supervised stereo synthesis video paradigm via a video diffusion model, termed SpatialDreamer.<n>To address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG.<n>We also propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training.
arXiv Detail & Related papers (2024-11-18T15:12:59Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.