Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
- URL: http://arxiv.org/abs/2509.16628v1
- Date: Sat, 20 Sep 2025 11:07:36 GMT
- Title: Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
- Authors: Janak Kapuriya, Anwar Shaikh, Arnav Goel, Medha Hira, Apoorv Singh, Jay Saraf, Sanjana, Vaibhav Nauriyal, Avinash Anand, Zhengkui Wang, Rajiv Ratn Shah,
- Abstract summary: We introduce Vision-Caption aware Supervised FineTuning (VCASFT)<n>VCASFT is a learning paradigm designed to enhance the performance of smaller Vision Language Models (VLMs)<n>We benchmark it on ScienceQA, which consists of questions across diverse languages, subjects, and fields.<n>To further demonstrate the effectiveness of this technique on lowresource languages, we developed HiSciVQA, a dataset comprising 2,245 high-quality, hand-annotated Hindi multimodal Q&A pairs.
- Score: 26.89241254462218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we introduce Vision-Caption aware Supervised FineTuning (VCASFT), a novel learning paradigm designed to enhance the performance of smaller Vision Language Models(VLMs) on scientific visual question answering(VQA) tasks. VCASFT leverages image captions as zero-shot prompts alongside question-answer pairs and instruction-tunes models to yield significant performance improvements. To comprehensively evaluate VCASFT, we benchmark it on ScienceQA, which consists of questions across diverse languages, subjects, and fields, demonstrating its adaptability and effectiveness in a variety of educational contexts. Additionally, to further demonstrate the effectiveness of this technique on lowresource languages, we developed HiSciVQA, a dataset comprising 2,245 high-quality, hand-annotated Hindi multimodal Q&A pairs. This dataset addresses the critical need for low-resource language Q&A datasets and serves as a foundation for testing VCASFT. Additionally, we introduce a novel LLM-based evaluation scheme to evaluate VLMs on HiSciVQA which offers deeper insights into model effectiveness surpassing traditional n-gram matching accuracy metrics. We are committed to advancing the field by open-sourcing all code files and the HiSciVQA dataset for the research community.
Related papers
- When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs [4.296395082987112]
Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks.<n>Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts.<n>We introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs.
arXiv Detail & Related papers (2025-09-20T11:12:23Z) - Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation [33.51239538610773]
Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks.<n>We propose Q-CLIP, the first fully VLMs-based framework for Video Quality Assessment (VQA)
arXiv Detail & Related papers (2025-08-08T07:36:01Z) - Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models [11.114790704621427]
Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic.<n>We propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment.<n>Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches.
arXiv Detail & Related papers (2025-03-02T05:44:56Z) - VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.<n>Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.<n>We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.<n>The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers.
Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored.
This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant [48.220285886328746]
We introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant.
SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge.
Fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods.
arXiv Detail & Related papers (2024-03-17T18:42:38Z) - Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering [6.798129852396113]
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance.<n>We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection.<n>To mitigate the challenges associated with evaluating free-form open-ended VQA responses, we introduce a straightforward LLM-guided pre-processing technique.
arXiv Detail & Related papers (2023-06-16T17:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.