Related papers: Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

URL: http://arxiv.org/abs/2409.01437v1
Date: Mon, 2 Sep 2024 19:41:59 GMT
Title: Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Authors: Sushant Gautam, Andrea Storås, Cise Midoglu, Steven A. Hicks, Vajira Thambawita, Pål Halvorsen, Michael A. Riegler,
Abstract summary: This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification.
Score: 4.250633109741797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Kvasir-VQA, an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations to facilitate advanced machine learning tasks in Gastrointestinal (GI) diagnostics. This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments, and it supports multiple question types including yes/no, choice, location, and numerical count. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification. Our experiments demonstrate the dataset's effectiveness in training models for three selected tasks, showcasing significant applications in medical image analysis and diagnostics. We also present evaluation metrics for each task, highlighting the usability and versatility of our dataset. The dataset and supporting artifacts are available at https://datasets.simula.no/kvasir-vqa.

Related papers

AlphaDent: A dataset for automated tooth pathology detection [98.1937495272719]
This dataset is based on the DSLR camera photographs of the teeth of 295 patients and contains over 1200 images.<n>The article provides a detailed description of the dataset and the labeling format.<n>The results obtained show high quality of predictions.
arXiv Detail & Related papers (2025-07-30T09:34:43Z)
Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy [3.3091869879941687]
We introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy.<n>Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs.<n>By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems.
arXiv Detail & Related papers (2025-06-11T17:31:38Z)
MRGen: Segmentation Data Engine for Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities.<n>We present MRGen, a data engine for controllable medical image synthesis conditioned on text prompts and segmentation masks.
arXiv Detail & Related papers (2024-12-04T16:34:22Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering [8.547600133510551]
This paper develops a Benchmark Evaluation SysTem for Medical Visual Question Answering, denoted by BESTMVQA. Our system provides a useful tool for users to automatically build Med-VQA datasets, which helps overcoming the data insufficient problem. With simple configurations, our system automatically trains and evaluates the selected models over a benchmark dataset.
arXiv Detail & Related papers (2023-12-13T03:08:48Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [45.38823400370285]
Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA) In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA.
arXiv Detail & Related papers (2023-03-01T14:21:19Z)
Medical visual question answering using joint self-supervised learning [8.817054025763325]
The encoder embeds across the image-text dual modalities with self-attention mechanism. The decoder is connected to the top of the encoder and fine-tuned using the small-sized medical VQA dataset.
arXiv Detail & Related papers (2023-02-25T12:12:22Z)
Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs [55.78588835407174]
Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
arXiv Detail & Related papers (2022-08-17T09:07:45Z)
SimVQA: Exploring Simulated Environments for Visual Question Answering [15.030013924109118]
We explore using synthetic computer-generated data to fully control the visual and language space. We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data. We propose Feature Swapping (F-SWAP) -- where we randomly switch object-level features during training to make a VQA model more domain invariant.
arXiv Detail & Related papers (2022-03-31T17:44:27Z)
Application of DatasetGAN in medical imaging: preliminary studies [10.260087683496431]
Generative adversarial networks (GANs) have been widely investigated for many potential applications in medical imaging. datasetGAN is a recently proposed framework based on modern GANs that can synthesize high-quality segmented images. There are no published studies focusing on its applications to medical imaging.
arXiv Detail & Related papers (2022-02-27T22:03:20Z)
A Survey on RGB-D Datasets [69.73803123972297]
This paper reviewed and categorized image datasets that include depth information. We gathered 203 datasets that contain accessible data and grouped them into three categories: scene/objects, body, and medical.
arXiv Detail & Related papers (2022-01-15T05:35:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.