Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection
- URL: http://arxiv.org/abs/2512.13250v1
- Date: Mon, 15 Dec 2025 12:04:26 GMT
- Title: Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection
- Authors: Juil Koo, Daehyeon Choi, Sangwoo Youn, Phillip Y. Lee, Minhyuk Sung,
- Abstract summary: We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image.<n>Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes.
- Score: 26.020338338880034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
Related papers
- Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts [2.3160863001888914]
We introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception.<n>AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts.<n>AerialSense is a benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks.
arXiv Detail & Related papers (2025-12-08T08:44:57Z) - Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z) - VoQA: Visual-only Question Answering [7.251596370310251]
We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images.<n>This requires models to locate, recognize, and reason over visually embedded textual questions.<n>We introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input.
arXiv Detail & Related papers (2025-05-20T11:37:49Z) - VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language.
We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features.
Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant [48.220285886328746]
We introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant.
SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge.
Fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods.
arXiv Detail & Related papers (2024-03-17T18:42:38Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering [6.798129852396113]
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance.<n>We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection.<n>To mitigate the challenges associated with evaluating free-form open-ended VQA responses, we introduce a straightforward LLM-guided pre-processing technique.
arXiv Detail & Related papers (2023-06-16T17:47:57Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.