PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
- URL: http://arxiv.org/abs/2502.17540v1
- Date: Mon, 24 Feb 2025 18:35:39 GMT
- Title: PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
- Authors: Rohit Saxena, Pasquale Minervini, Frank Keller,
- Abstract summary: PosterSum is a novel benchmark to advance the development of vision-language models.<n>We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum.<n>We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics.
- Score: 19.416714365519713
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics, achieving a 3.14% gain in ROUGE-L. This will serve as a starting point for future research on poster summarization.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations [47.79536652721794]
This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains.<n>We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts.
arXiv Detail & Related papers (2025-02-12T10:36:55Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - Bridging Research and Readers: A Multi-Modal Automated Academic Papers
Interpretation System [47.13932723910289]
We introduce an open-source multi-modal automated academic paper interpretation system (MMAPIS) with three-step process stages.
It employs the hybrid modality preprocessing and alignment module to extract plain text, and tables or figures from documents separately.
It then aligns this information based on the section names they belong to, ensuring that data with identical section names are categorized under the same section.
It utilizes the extracted section names to divide the article into shorter text segments, facilitating specific summarizations both within and between sections via LLMs.
arXiv Detail & Related papers (2024-01-17T11:50:53Z) - mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
Language Model [73.38800189095173]
This work focuses on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.
By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper.
M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes.
arXiv Detail & Related papers (2023-11-30T04:43:26Z) - InternLM-XComposer: A Vision-Language Large Model for Advanced
Text-image Comprehension and Composition [111.65584066987036]
InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition.
It can effortlessly generate coherent and contextual articles that seamlessly integrate images.
It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates.
arXiv Detail & Related papers (2023-09-26T17:58:20Z) - Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z) - Neural Content Extraction for Poster Generation of Scientific Papers [84.30128728027375]
The problem of poster generation for scientific papers is under-investigated.
Previous studies focus mainly on poster layout and panel composition, while neglecting the importance of content extraction.
To get both textual and visual elements of a poster panel, a neural extractive model is proposed to extract text, figures and tables of a paper section simultaneously.
arXiv Detail & Related papers (2021-12-16T01:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.