Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis
- URL: http://arxiv.org/abs/2507.13285v1
- Date: Thu, 17 Jul 2025 16:50:07 GMT
- Title: Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis
- Authors: Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao,
- Abstract summary: We introduce RCPS, a novel framework for automated generation of high-quality media presentations.<n>We also propose PREVAL, a preference-based evaluation framework to assess presentation quality across Content, Coherence, and Design.<n>PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.
- Score: 2.846897538377738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.
Related papers
- Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity [78.7107376451476]
Hi3DEval is a hierarchical evaluation framework tailored for 3D generative content.<n>We extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism.<n>We propose a 3D-aware automated scoring system based on hybrid 3D representations.
arXiv Detail & Related papers (2025-08-07T17:50:13Z) - Creativity in LLM-based Multi-Agent Systems: A Survey [56.25583236738877]
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts.<n>This is the first survey dedicated to creativity in MAS.<n>We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.
arXiv Detail & Related papers (2025-05-27T12:36:14Z) - Unified Reward Model for Multimodal Understanding and Generation [32.22714522329413]
This paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment.<n>We first develop UnifiedReward on our constructed large-scale human preference dataset, including both image and video generation/understanding tasks.
arXiv Detail & Related papers (2025-03-07T08:36:05Z) - SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner [35.843587407696006]
Large Visual Language Models (LVLMs) increasingly rely on preference alignment to ensure reliability.<n>We present projectname, a self-supervised framework capable of transforming the already abundant supervised text-image pairs into holistic preference triplets.
arXiv Detail & Related papers (2025-03-06T08:33:11Z) - M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z) - CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation [22.139826276559724]
Key components such as the human preference alignment play a crucial role in ensuring generation quality.<n>We introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions.<n>We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset.
arXiv Detail & Related papers (2025-02-18T06:31:08Z) - PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides [51.88536367177796]
We propose a two-stage, edit-based approach inspired by human drafts for automatically generating presentations.<n>PWTAgent first analyzes references to extract slide-level functional types and content schemas, then generates editing actions based on selected reference slides.<n>PWTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
arXiv Detail & Related papers (2025-01-07T16:53:01Z) - Can foundation models actively gather information in interactive environments to test hypotheses? [56.651636971591536]
We introduce a framework in which a model must determine the factors influencing a hidden reward function.<n>We investigate whether approaches such as self- throughput and increased inference time improve information gathering efficiency.
arXiv Detail & Related papers (2024-12-09T12:27:21Z) - HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction [24.46493675079128]
OCR-dependent methods rely on offline OCR engines, while OCR-free methods might produce outputs that lack interpretability or contain hallucinated content.
We propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task.
Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.
arXiv Detail & Related papers (2024-11-02T05:00:13Z) - Towards Fine-grained Human Pose Transfer with Detail Replenishing
Network [96.54367984986898]
Human pose transfer (HPT) is an emerging research topic with huge potential in fashion design, media production, online advertising and virtual reality.
Existing HPT methods often suffer from three fundamental issues: detail deficiency, content ambiguity and style inconsistency.
We develop a more challenging yet practical HPT setting, termed as Fine-grained Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail replenishment.
arXiv Detail & Related papers (2020-05-26T03:05:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.