MedICaT: A Dataset of Medical Images, Captions, and Textual References
- URL: http://arxiv.org/abs/2010.06000v1
- Date: Mon, 12 Oct 2020 19:56:08 GMT
- Title: MedICaT: A Dataset of Medical Images, Captions, and Textual References
- Authors: Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine
van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi
- Abstract summary: Previous work focused on classifying figure content rather than understanding how images relate to the text.
MedICaT consists of 217K images from 131K open access biomedical papers.
Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures.
- Score: 71.3960667004975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the relationship between figures and text is key to scientific
document understanding. Medical figures in particular are quite complex, often
consisting of several subfigures (75% of figures in our dataset), with detailed
text describing their content. Previous work studying figures in scientific
papers focused on classifying figure content rather than understanding how
images relate to the text. To address challenges in figure retrieval and
figure-to-text alignment, we introduce MedICaT, a dataset of medical images in
context. MedICaT consists of 217K images from 131K open access biomedical
papers, and includes captions, inline references for 74% of figures, and
manually annotated subfigures and subcaptions for a subset of figures. Using
MedICaT, we introduce the task of subfigure to subcaption alignment in compound
figures and demonstrate the utility of inline references in image-text
matching. Our data and code can be accessed at
https://github.com/allenai/medicat.
Related papers
- DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z) - Understanding Social Media Cross-Modality Discourse in Linguistic Space [26.19949919969774]
We present a novel concept of cross-modality discourse, reflecting how human readers couple image and text understandings.
We build the very first dataset containing 16K multimedia tweets with manually annotated discourse labels.
arXiv Detail & Related papers (2023-02-26T13:04:04Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph.
We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images.
The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.