Facilitating Video Story Interaction with Multi-Agent Collaborative System
- URL: http://arxiv.org/abs/2505.03807v1
- Date: Fri, 02 May 2025 09:08:13 GMT
- Title: Facilitating Video Story Interaction with Multi-Agent Collaborative System
- Authors: Yiwen Zhang, Jianing Hao, Zhan Wang, Hongling Sheng, Wei Zeng,
- Abstract summary: Our system uses a Vision Language Model (VLM) to enable machines to understand video stories.<n>It combines Retrieval-Augmented Generation (RAG) and a Multi-Agent System (MAS) to create evolving characters and scene experiences.
- Score: 7.7519050921867825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combining Retrieval-Augmented Generation (RAG) and a Multi-Agent System (MAS) to create evolving characters and scene experiences. It includes three stages: 1) Video story processing, utilizing VLM and prior knowledge to simulate human understanding of stories across three modalities. 2) Multi-space chat, creating growth-oriented characters through MAS interactions based on user queries and story stages. 3) Scene customization, expanding and visualizing various story scenes mentioned in dialogue. Applied to the Harry Potter series, our study shows the system effectively portrays emergent character social behavior and growth, enhancing the interactive experience in the video story world.
Related papers
- TaleForge: Interactive Multimodal System for Personalized Story Creation [15.193340794653261]
TaleForge is a personalized story-generation system that embeds users' facial images within both narratives and illustrations.<n>A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists.
arXiv Detail & Related papers (2025-06-27T00:45:38Z) - StorySage: Conversational Autobiography Writing Powered by a Multi-Agent Framework [40.06696963935616]
StorySage is a user-driven software system designed to meet the needs of a diverse group of users.<n>Our system iteratively collects user memories, updates their autobiography, and plans for future conversations.
arXiv Detail & Related papers (2025-06-17T03:44:47Z) - Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts [20.281732318265483]
We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue.<n>Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior.<n>We render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives.
arXiv Detail & Related papers (2025-05-22T15:54:42Z) - From Panels to Prose: Generating Literary Narratives from Comics [55.544015596503726]
We develop an automated system that generates text-based literary narratives from manga comics.<n>Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters.
arXiv Detail & Related papers (2025-03-30T07:18:10Z) - Towards Enhanced Immersion and Agency for LLM-based Interactive Drama [55.770617779283064]
This paper begins with understanding interactive drama from two aspects: Immersion, the player's feeling of being present in the story, and Agency.<n>To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality.
arXiv Detail & Related papers (2025-02-25T06:06:16Z) - SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation [38.96874874208242]
We introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy.<n>Specifically, we employ Large Language Models with Retrieval-Augmented Generation to generate coherent and diverse long-form scripts.<n>A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues.
arXiv Detail & Related papers (2024-11-29T18:36:15Z) - DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method.<n>We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning.<n>DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - A Character-Centric Creative Story Generation via Imagination [15.345466372805516]
We introduce a novel story generation framework called CCI (Character-centric Creative story generation via Imagination)<n> CCI features two modules for creative story generation: IG (Image-Guided Imagination) and MW (Multi-Writer model)<n>In the IG module, we utilize a text-to-image model to create visual representations of key story elements, such as characters, backgrounds, and main plots.<n>The MW module uses these story elements to generate multiple persona-description candidates and selects the best one to insert into the story, thereby enhancing the richness and depth of the narrative.
arXiv Detail & Related papers (2024-09-25T06:54:29Z) - Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models [57.30913211264333]
We present Story3D-Agent, a pioneering approach that transforms provided narratives into 3D-rendered visualizations.
By integrating procedural modeling, our approach enables precise control over multi-character actions and motions, as well as diverse decorative elements.
We have thoroughly evaluated our Story3D-Agent to validate its effectiveness, offering a basic framework to advance 3D story representation.
arXiv Detail & Related papers (2024-08-21T17:43:15Z) - SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories.
We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.
We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z) - NarrativePlay: Interactive Narrative Understanding [27.440721435864194]
We introduce NarrativePlay, a novel system that allows users to role-play a fictional character and interact with other characters in narratives in an immersive environment.
We leverage Large Language Models (LLMs) to generate human-like responses, guided by personality traits extracted from narratives.
NarrativePlay has been evaluated on two types of narratives, detective and adventure stories, where users can either explore the world or improve their favorability with the narrative characters through conversations.
arXiv Detail & Related papers (2023-10-02T13:24:00Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.