Related papers: PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

Related papers

Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis [2.846897538377738]
We introduce RCPS, a novel framework for automated generation of high-quality media presentations.<n>We also propose PREVAL, a preference-based evaluation framework to assess presentation quality across Content, Coherence, and Design.<n>PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.
arXiv Detail & Related papers (2025-07-17T16:50:07Z)
PresentAgent: Multimodal Agent for Presentation Video Generation [30.274831875701217]
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos.<n>To achieve this integration, PresentAgent employs a modular pipeline that segments the input document, plans and renders slide-style visual frames.<n>Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models.
arXiv Detail & Related papers (2025-07-05T13:24:15Z)
PreGenie: An Agentic Framework for High-quality Visual Presentation Generation [25.673526096069548]
PreGenie is an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations.<n>It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations.
arXiv Detail & Related papers (2025-05-27T18:36:19Z)
Taming LLMs with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback [15.90651992769166]
This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience.<n>We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics.<n>Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics.
arXiv Detail & Related papers (2025-05-23T14:27:57Z)
Visual Consensus Prompting for Co-Salient Object Detection [26.820772908765083]
We propose an interaction-effective and parameter-efficient concise architecture for the co-salient object detection task. A parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP) OurVCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset)
arXiv Detail & Related papers (2025-04-19T10:12:39Z)
Generative Compositor for Few-Shot Visual Information Extraction [60.663887314625164]
We propose a novel generative model, named Generative generative spatialtor, to address the challenge of few-shot VIE. Generative generative spatialtor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text. The proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
arXiv Detail & Related papers (2025-03-21T04:56:24Z)
Textual-to-Visual Iterative Self-Verification for Slide Generation [46.99825956909532]
We decompose the task of generating missing presentation slides into two key components: content generation and layout generation. Our approach significantly outperforms baseline methods in terms of alignment, logical flow, visual appeal, and readability.
arXiv Detail & Related papers (2025-02-21T12:21:09Z)
HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction [24.46493675079128]
OCR-dependent methods rely on offline OCR engines, while OCR-free methods might produce outputs that lack interpretability or contain hallucinated content. We propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.
arXiv Detail & Related papers (2024-11-02T05:00:13Z)
IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z)
Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness [3.2925222641796554]
"pointer-guided segment ordering" (SO) is a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations. Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures.
arXiv Detail & Related papers (2024-06-06T15:17:51Z)
LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification [5.8754760054410955]
We introduce textttHi-CoDecomposition, a novel framework designed to enhance model interpretability through structured concept analysis. Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process.
arXiv Detail & Related papers (2024-05-29T00:36:56Z)
Point-In-Context: Understanding Point Cloud via In-Context Learning [67.20277182808992]
We introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module. We propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S)
arXiv Detail & Related papers (2024-04-18T17:32:32Z)
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z)
MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z)
TaCo: Textual Attribute Recognition via Contrastive Learning [9.042957048594825]
TaCo is a contrastive framework for textual attribute recognition tailored toward the most common document scenes. We design the learning paradigm from three perspectives: 1) generating attribute views, 2) extracting subtle but crucial details, and 3) exploiting valued view pairs for learning. Experiments show that TaCo surpasses the supervised counterparts and advances the state-of-the-art remarkably on multiple attribute recognition tasks.
arXiv Detail & Related papers (2022-08-22T09:45:34Z)
Weakly Supervised Concept Map Generation through Task-Guided Graph Translation [9.203403318435486]
GT-D2G is an automatic concept map generation framework that leverages generalized NLP pipelines to derive semantic-rich initial graphs. The quality and interpretability of such concept maps are validated through human evaluation on three real-world corpora.
arXiv Detail & Related papers (2021-10-08T20:17:10Z)
Summary Explorer: Visualizing the State of the Art in Text Summarization [23.45323725326221]
This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias) encapsulated in a guided assessment based on tailored visualizations.
arXiv Detail & Related papers (2021-08-04T07:11:19Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.