FlySearch: Exploring how vision-language models explore
- URL: http://arxiv.org/abs/2506.02896v3
- Date: Mon, 20 Oct 2025 22:34:14 GMT
- Title: FlySearch: Exploring how vision-language models explore
- Authors: Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz, Marek Cygan, Bartosz Zieliński, Maciej Wołczyk,
- Abstract summary: We introduce FlySearch, a 3D, outdoor, environment for searching and navigating to objects in complex scenes.<n>We observe that state-of-the-art Vision-Language Models (VLMs) cannot reliably solve even the simplest exploration tasks.<n>We identify a set of central causes, ranging from vision, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning.
- Score: 5.7210882663967615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.
Related papers
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models [79.77807330964576]
Vision-DeepResearch systems use search engines for complex visual-textual fact-finding.<n>Existing benchmarks are not visual search-centric.<n>We construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances.
arXiv Detail & Related papers (2026-02-02T14:53:11Z) - InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search [48.79494320593913]
We introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details.<n>O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning.<n>We propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher)
arXiv Detail & Related papers (2025-12-21T14:23:07Z) - SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks [53.611256895338585]
We introduce SIRI-Bench, a benchmark designed to evaluate Vision-Language Models' spatial intelligence through video-based reasoning tasks.<n> SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video.<n>To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine.
arXiv Detail & Related papers (2025-06-17T13:40:00Z) - SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models [10.671262416557704]
Vision Foundation Models (VFMs) offer powerful capabilities for visual understanding and reasoning.<n>We present a zero-shot object goal navigation framework that integrates the perceptual strength of VFMs with a model-based planner.<n>We evaluate our approach on the HM3D dataset using the Habitat simulator and demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-06-04T03:04:54Z) - Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images.<n>We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs.<n>Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z) - How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game [11.721839449847472]
We introduce MM-Escape, a benchmark for investigating multimodal reasoning.<n> MM-Escape emphasizes intermediate model behaviors alongside final task completion.<n>Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks.<n>We observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities.
arXiv Detail & Related papers (2025-03-13T04:48:43Z) - BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games [44.16513620589459]
We introduce BALROG, a novel benchmark to assess the agentic capabilities of Large Language Models (LLMs) and Vision Language Models (VLMs)<n>Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master.<n>Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks.
arXiv Detail & Related papers (2024-11-20T18:54:32Z) - ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.<n>One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning.<n>We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video [18.14234312389889]
Vision-Language Models (VLMs) have shown success as foundational models for downstream vision and natural language applications.<n>We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations.<n>We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images.
arXiv Detail & Related papers (2024-07-18T18:55:56Z) - Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - An Embodied Generalist Agent in 3D World [67.16935110789528]
We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world.
We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world.
Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.
arXiv Detail & Related papers (2023-11-18T01:21:38Z) - CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation.
It incorporates environmental feedback for refining future plans and adjusting its actions.
It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z) - Batch Exploration with Examples for Scalable Robotic Reinforcement
Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states.
BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z) - An Exploration of Embodied Visual Exploration [97.21890864063872]
Embodied computer vision considers perception for robots in novel, unstructured environments.
We present a taxonomy for existing visual exploration algorithms and create a standard framework for benchmarking them.
We then perform a thorough empirical study of the four state-of-the-art paradigms using the proposed framework.
arXiv Detail & Related papers (2020-01-07T17:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.