Towards Answering Health-related Questions from Medical Videos: Datasets
and Approaches
- URL: http://arxiv.org/abs/2309.12224v1
- Date: Thu, 21 Sep 2023 16:21:28 GMT
- Title: Towards Answering Health-related Questions from Medical Videos: Datasets
and Approaches
- Authors: Deepak Gupta, Kush Attal, and Dina Demner-Fushman
- Abstract summary: A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks.
The instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions.
The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions.
- Score: 21.16331827504689
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The increase in the availability of online videos has transformed the way we
access information and knowledge. A growing number of individuals now prefer
instructional videos as they offer a series of step-by-step procedures to
accomplish particular tasks. The instructional videos from the medical domain
may provide the best possible visual answers to first aid, medical emergency,
and medical education questions. Toward this, this paper is focused on
answering health-related questions asked by the public by providing visual
answers from medical videos. The scarcity of large-scale datasets in the
medical domain is a key challenge that hinders the development of applications
that can help the public with their health-related questions. To address this
issue, we first proposed a pipelined approach to create two large-scale
datasets: HealthVidQA-CRF and HealthVidQA-Prompt. Later, we proposed monomodal
and multimodal approaches that can effectively provide visual answers from
medical videos to natural language questions. We conducted a comprehensive
analysis of the results, focusing on the impact of the created datasets on
model training and the significance of visual features in enhancing the
performance of the monomodal and multi-modal approaches. Our findings suggest
that these datasets have the potential to enhance the performance of medical
visual answer localization tasks and provide a promising future direction to
further enhance the performance by using pre-trained language-vision models.
Related papers
- STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data.
We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z) - MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - Medical Vision Language Pretraining: A survey [8.393439175704124]
Medical Vision Language Pretraining is a promising solution to the scarcity of labeled data in the medical domain.
By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations.
arXiv Detail & Related papers (2023-12-11T09:14:13Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery.
We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z) - Can Pre-trained Vision and Language Models Answer Visual
Information-Seeking Questions? [50.29862466940209]
We introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions.
We analyze various pre-trained visual question answering models and gain insights into their characteristics.
We show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents.
arXiv Detail & Related papers (2023-02-23T00:33:54Z) - A Dataset for Medical Instructional Video Classification and Question
Answering [16.748852458926162]
This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos.
We believe medical videos may provide the best possible answers to many first aids, medical emergency, and medical education questions.
We have benchmarked each task with the created MedVidCL and MedVidQA datasets and proposed the multimodal learning methods.
arXiv Detail & Related papers (2022-01-30T18:06:31Z) - Medical Visual Question Answering: A Survey [55.53205317089564]
Medical Visual Question Answering(VQA) is a combination of medical artificial intelligence and popular VQA challenges.
Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer.
arXiv Detail & Related papers (2021-11-19T05:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.