Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs
- URL: http://arxiv.org/abs/2503.14674v1
- Date: Tue, 18 Mar 2025 19:29:07 GMT
- Title: Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs
- Authors: Liu Jing, Amirul Rahman,
- Abstract summary: We propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training.<n>Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs.<n>We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.
Related papers
- OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)
We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.
OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs [86.21199607040147]
Multimodal Large Language Models (MLLMs) face challenges with fine-grained perception and complex reasoning.<n>Prevalent multimodal pre-training approaches focus on enhancing perception by training on high-quality image captions.<n>We introduce Self-Improving cognition (SIcog), a self-learning framework designed to construct next-generation foundation MLLMs.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Memory-enhanced Retrieval Augmentation for Long Video Understanding [57.371543819761555]
We introduce a novel RAG-based LVU approach inspired by the cognitive memory of human beings, which is called MemVid.<n>Our approach operates with four basics steps: memorizing holistic video information, reasoning about the task's information needs based on the memory, retrieving critical moments based on the information needs, and focusing on the retrieved moments to produce the final answer.
arXiv Detail & Related papers (2025-03-12T08:23:32Z) - Improving Generalization in Visual Reasoning via Self-Ensemble [0.0]
We propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters.
Our key insight is that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities.
arXiv Detail & Related papers (2024-10-28T10:04:40Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues.
Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images.
This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z) - Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [77.72128397088409]
We show that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question.
We also propose a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction.
arXiv Detail & Related papers (2024-04-19T15:53:27Z) - An Enhanced Prompt-Based LLM Reasoning Scheme via Knowledge Graph-Integrated Collaboration [7.3636034708923255]
This study proposes a collaborative training-free reasoning scheme involving tight cooperation between Knowledge Graph (KG) and Large Language Models (LLMs)
Through such a cooperative approach, our scheme achieves more reliable knowledge-based reasoning and facilitates the tracing of the reasoning results.
arXiv Detail & Related papers (2024-02-07T15:56:17Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - KECP: Knowledge Enhanced Contrastive Prompting for Few-shot Extractive
Question Answering [28.18555591429343]
We propose a novel framework named Knowledge Enhanced Contrastive Prompt-tuning (KECP)
Instead of adding pointer heads to PLMs, we transform the task into a non-autoregressive Masked Language Modeling (MLM) generation problem.
Our method consistently outperforms state-of-the-art approaches in few-shot settings by a large margin.
arXiv Detail & Related papers (2022-05-06T08:31:02Z) - Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks [59.761411682238645]
Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks.
We introduce a method to incorporate evidentiality of passages -- whether a passage contains correct evidence to support the output -- into training the generator.
arXiv Detail & Related papers (2021-12-16T08:18:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.