Related papers: SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations

SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations

URL: http://arxiv.org/abs/2511.09804v1
Date: Fri, 14 Nov 2025 01:10:17 GMT
Title: SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations
Authors: Eric Xie, Danielle Waterfield, Michael Kennedy, Aidong Zhang,
Abstract summary: Large Language Models (LLMs) have shown immense potential in education, automating tasks like quiz generation and content summarization.<n>Existing LLM-based solutions often fail to produce reliable and informative outputs, limiting their educational value.<n>We introduce SlideBot - a modular, multi-agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation.
Score: 29.874786844781138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown immense potential in education, automating tasks like quiz generation and content summarization. However, generating effective presentation slides introduces unique challenges due to the complexity of multimodal content creation and the need for precise, domain-specific information. Existing LLM-based solutions often fail to produce reliable and informative outputs, limiting their educational value. To address these limitations, we introduce SlideBot - a modular, multi-agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation. SlideBot is organized around three pillars: informativeness, ensuring deep and contextually grounded content; reliability, achieved by incorporating external sources through retrieval; and practicality, which enables customization and iterative feedback through instructor collaboration. It incorporates evidence-based instructional design principles from Cognitive Load Theory (CLT) and the Cognitive Theory of Multimedia Learning (CTML), using structured planning to manage intrinsic load and consistent visual macros to reduce extraneous load and enhance dual-channel learning. Within the system, specialized agents collaboratively retrieve information, summarize content, generate figures, and format slides using LaTeX, aligning outputs with instructor preferences through interactive refinement. Evaluations from domain experts and students in AI and biomedical education show that SlideBot consistently enhances conceptual accuracy, clarity, and instructional value. These findings demonstrate SlideBot's potential to streamline slide preparation while ensuring accuracy, relevance, and adaptability in higher education.

Related papers

SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation [26.4357968329723]
SlideGen is an agentic, modular, and visual in the loop framework for scientific paper to slide generation.<n>It orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editableX slides with logical flow and compelling visual presentation.
arXiv Detail & Related papers (2025-12-04T07:22:16Z)
From Slides to Chatbots: Enhancing Large Language Models with University Course Materials [14.450839675608693]
We investigate how incorporating university course materials can enhance LLM performance in computer science courses.<n>We compare two strategies, Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT), to extend LLMs with course-specific knowledge.<n>Our experiments reveal that, given the relatively small size of university course materials, RAG is more effective and efficient than CPT.
arXiv Detail & Related papers (2025-10-25T12:31:26Z)
Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z)
PreGenie: An Agentic Framework for High-quality Visual Presentation Generation [44.93958820783717]
PreGenie is an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations.<n>It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations.
arXiv Detail & Related papers (2025-05-27T18:36:19Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases. We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z)
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations.<n>Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z)
Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings. We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z)
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects. We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.