Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
- URL: http://arxiv.org/abs/2501.19353v3
- Date: Tue, 18 Feb 2025 18:07:34 GMT
- Title: Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
- Authors: Ting-Yao E. Hsu, Yi-Li Hsu, Shaurya Rohatgi, Chieh-Yang Huang, Ho Yin Sam Ng, Ryan Rossi, Sungchul Kim, Tong Yu, Lun-Wei Ku, C. Lee Giles, Ting-Hao K. Huang,
- Abstract summary: In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields.
This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state.
We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors.
- Score: 33.089795292870186
- License:
- Abstract: Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?
Related papers
- BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions [118.35194230865451]
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs.
KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions.
We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
arXiv Detail & Related papers (2024-11-12T00:52:52Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.
We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.
Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings [28.973082312034343]
This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions.
SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality.
A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing.
arXiv Detail & Related papers (2024-03-26T15:16:14Z) - CapsFusion: Rethinking Image-Text Data at Scale [32.334143749598766]
We propose CapsFusion to consolidate and refine information from both web-based image-text pairs and synthetic captions.
Experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance.
arXiv Detail & Related papers (2023-10-31T15:31:39Z) - SciCap+: A Knowledge Augmented Dataset to Study the Challenges of
Scientific Figure Captioning [18.94446071846939]
Figure caption generation helps move model understandings of scientific documents beyond text.
We extend the large-scale SciCap dataset to include mention-paragraphs (paragraphs mentioning figures) and OCR tokens.
Our results indicate that mention-paragraphs serves as additional context knowledge, which significantly boosts the automatic standard image caption evaluation scores.
arXiv Detail & Related papers (2023-06-06T08:16:16Z) - Summaries as Captions: Generating Figure Captions for Scientific
Documents with Automated Text Summarization [31.619379039184263]
Figure caption generation can be more effectively tackled as a text summarization task in scientific documents.
We fine-tuned PEG, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs.
Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations.
arXiv Detail & Related papers (2023-02-23T20:39:06Z) - Using Large Language Models to Generate Engaging Captions for Data
Visualizations [51.98253121636079]
Large language models (LLM) use sophisticated deep learning technology to produce human-like prose.
Key challenge lies in designing the most effective prompt for the LLM, a task called prompt engineering.
We report on first experiments using the popular LLM GPT-3 and deliver some promising results.
arXiv Detail & Related papers (2022-12-27T23:56:57Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - SciCap: Generating Captions for Scientific Figures [20.696070723932866]
We introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020.
After pre-processing, SCICAP contained more than two million figures extracted from over 290,000 papers.
We established baseline models that caption graph plots, the dominant (19.2%) figure type.
arXiv Detail & Related papers (2021-10-22T07:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.