PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction
- URL: http://arxiv.org/abs/2508.13602v2
- Date: Sat, 30 Aug 2025 04:28:46 GMT
- Title: PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction
- Authors: Xiaolu Hou, Bing Ma, Jiaxiang Cheng, Xuhua Ren, Kai Yu, Wenyue Li, Tianxiang Zheng, Qinglin Lu,
- Abstract summary: PersonaVlog is an automated multimodal stylized Vlog generation framework.<n>We propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs)<n>We also propose ThemeVlogEval, a theme-based automated benchmarking framework.
- Score: 16.08952891878684
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.
Related papers
- A Versatile Multimodal Agent for Multimedia Content Generation [66.86040734610073]
We propose a MultiMedia-Agent designed to automate complex content creation tasks.<n>Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment.
arXiv Detail & Related papers (2026-01-06T18:49:47Z) - Kling-Omni Technical Report [80.64599716667777]
We present Kling- Omni, a generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs.<n>Kling- Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks.<n>It supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation.
arXiv Detail & Related papers (2025-12-18T17:08:12Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling [25.64589522035151]
MAViS is an end-to-end collaborative framework for long-sequence video storytelling.<n>It orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, generation, video animation, and audio generation.<n>With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling.
arXiv Detail & Related papers (2025-08-11T21:42:41Z) - EchoShot: Multi-Shot Portrait Video Generation [37.77879735014084]
EchoShot is a native multi-shot framework for portrait customization built upon a foundation video diffusion model.<n>To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset.<n>To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts.
arXiv Detail & Related papers (2025-06-16T11:00:16Z) - CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM)<n>Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort.<n>Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - PUMA: Empowering Unified MLLM with Multi-granular Visual Generation [62.747751204215916]
We propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation.
PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs.
This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks.
arXiv Detail & Related papers (2024-10-17T17:59:57Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.