Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach
- URL: http://arxiv.org/abs/2406.06556v1
- Date: Sat, 1 Jun 2024 07:49:31 GMT
- Title: Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach
- Authors: Sambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan, Apoorv Saxena,
- Abstract summary: Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative.
We propose a multi-staged end-to-end model which uses a combination of LLM and VLM.
We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.
- Score: 21.8104104944488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding [41.43688559565315]
We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs)
Our approach employs multi-scale visual features to handle various font sizes within document images.
We introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text.
arXiv Detail & Related papers (2024-11-08T00:58:12Z) - LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding [103.69014172427026]
Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents.
We present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL) which broadens the capabilities of any LMM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models [9.232693392690702]
TextHawk is a document-oriented Multimodal Large Language Model (MLLM)
It is designed to explore efficient fine-grained perception by designing four dedicated components.
We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-14T09:48:37Z) - Meta-Task Prompting Elicits Embeddings from Large Language Models [54.757445048329735]
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation.
We generate high-quality sentence embeddings from Large Language Models without the need for model fine-tuning.
Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
arXiv Detail & Related papers (2024-02-28T16:35:52Z) - Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - LAPDoc: Layout-Aware Prompting for Documents [3.523208537466128]
We investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment.
Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15%.
arXiv Detail & Related papers (2024-02-15T10:00:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.