SciCap+: A Knowledge Augmented Dataset to Study the Challenges of
Scientific Figure Captioning
- URL: http://arxiv.org/abs/2306.03491v1
- Date: Tue, 6 Jun 2023 08:16:16 GMT
- Title: SciCap+: A Knowledge Augmented Dataset to Study the Challenges of
Scientific Figure Captioning
- Authors: Zhishen Yang, Raj Dabre, Hideki Tanaka, Naoaki Okazaki
- Abstract summary: Figure caption generation helps move model understandings of scientific documents beyond text.
We extend the large-scale SciCap dataset to include mention-paragraphs (paragraphs mentioning figures) and OCR tokens.
Our results indicate that mention-paragraphs serves as additional context knowledge, which significantly boosts the automatic standard image caption evaluation scores.
- Score: 18.94446071846939
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In scholarly documents, figures provide a straightforward way of
communicating scientific findings to readers. Automating figure caption
generation helps move model understandings of scientific documents beyond text
and will help authors write informative captions that facilitate communicating
scientific findings. Unlike previous studies, we reframe scientific figure
captioning as a knowledge-augmented image captioning task that models need to
utilize knowledge embedded across modalities for caption generation. To this
end, we extended the large-scale SciCap
dataset~\cite{hsu-etal-2021-scicap-generating} to SciCap+ which includes
mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we
conduct experiments with the M4C-Captioner (a multimodal transformer-based
model with a pointer network) as a baseline for our study. Our results indicate
that mention-paragraphs serves as additional context knowledge, which
significantly boosts the automatic standard image caption evaluation scores
compared to the figure-only baselines. Human evaluations further reveal the
challenges of generating figure captions that are informative to readers. The
code and SciCap+ dataset will be publicly available at
https://github.com/ZhishenYang/scientific_figure_captioning_dataset
Related papers
- Figuring out Figures: Using Textual References to Caption Scientific Figures [3.358364892753541]
Previous work in automatically generating figure captions has been largely unsuccessful and has defaulted to using single-layer LSTMs.
In our work, we use the SciCap datasets curated by Hsu et al. and use a variant of a CLIP+GPT-2 encoder-decoder model with cross-attention to generate captions conditioned on the image.
arXiv Detail & Related papers (2024-06-25T21:49:21Z) - Text-only Synthesis for Image Captioning [26.774411180980994]
We propose Text-only Synthesis for Image Captioning (ToCa)
We deconstruct caption text into structures and lexical words.
Massive captions that contain various patterns of lexical words are generated.
arXiv Detail & Related papers (2024-05-28T15:11:17Z) - SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings [28.973082312034343]
This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions.
SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality.
A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing.
arXiv Detail & Related papers (2024-03-26T15:16:14Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Summaries as Captions: Generating Figure Captions for Scientific
Documents with Automated Text Summarization [31.619379039184263]
Figure caption generation can be more effectively tackled as a text summarization task in scientific documents.
We fine-tuned PEG, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs.
Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations.
arXiv Detail & Related papers (2023-02-23T20:39:06Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - SciCap: Generating Captions for Scientific Figures [20.696070723932866]
We introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020.
After pre-processing, SCICAP contained more than two million figures extracted from over 290,000 papers.
We established baseline models that caption graph plots, the dominant (19.2%) figure type.
arXiv Detail & Related papers (2021-10-22T07:10:41Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.