AutoPresent: Designing Structured Visuals from Scratch
- URL: http://arxiv.org/abs/2501.00912v1
- Date: Wed, 01 Jan 2025 18:09:32 GMT
- Title: AutoPresent: Designing Structured Visuals from Scratch
- Authors: Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell,
- Abstract summary: We benchmark end-to-end image generation and program generation methods with a variety of models.<n>We create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation.
- Score: 99.766901203884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.
Related papers
- Textual-to-Visual Iterative Self-Verification for Slide Generation [46.99825956909532]
We decompose the task of generating missing presentation slides into two key components: content generation and layout generation.
Our approach significantly outperforms baseline methods in terms of alignment, logical flow, visual appeal, and readability.
arXiv Detail & Related papers (2025-02-21T12:21:09Z) - PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides [51.88536367177796]
We propose a two-stage, edit-based approach inspired by human drafts for automatically generating presentations.
PWTAgent first analyzes references to extract slide-level functional types and content schemas, then generates editing actions based on selected reference slides.
PWTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
arXiv Detail & Related papers (2025-01-07T16:53:01Z) - IDEA-Bench: How Far are Generative Models from Professional Designing? [34.00716542613326]
We introduce IDEA-Bench, a benchmark encompassing 100 real-world design tasks.<n>This includes rendering, visual effects, storyboarding, picture books, fonts, style-based, and identity-preserving generation.<n>Even the best-performing model only achieves 22.48 on IDEA-Bench, while the best general-purpose model only achieves 6.81.
arXiv Detail & Related papers (2024-12-16T13:39:32Z) - PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology [9.556246087301883]
We present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings.
PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use.
Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching that of a supervised aggregator model.
arXiv Detail & Related papers (2024-05-16T16:59:12Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition [49.52436478739151]
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios.
Recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition.
This paper aims to improve the confidence with view selection and hierarchical prompts.
arXiv Detail & Related papers (2023-11-30T09:51:53Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Classroom Slide Narration System [27.127537034521467]
Slide presentations are an effective and efficient tool used by the teaching community for classroom communication.
A Classroom Slide Narration System (CSNS) generates audio descriptions corresponding to the slide content.
The users have given better feedback on the quality output of the proposed CSNS in comparison to existing systems like Facebooks Automatic Alt-Text (AAT) and Tesseract.
arXiv Detail & Related papers (2022-01-21T07:20:03Z) - Improving Label Quality by Jointly Modeling Items and Annotators [68.8204255655161]
We propose a fully Bayesian framework for learning ground truth labels from noisy annotators.
Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model.
arXiv Detail & Related papers (2021-06-20T02:15:20Z) - DOC2PPT: Automatic Presentation Slides Generation from Scientific
Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation.
We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner.
Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.