Related papers: Thinking in 360°: Humanoid Visual Search in the Wild

Thinking in 360°: Humanoid Visual Search in the Wild

URL: http://arxiv.org/abs/2511.20351v2
Date: Wed, 26 Nov 2025 05:53:19 GMT
Title: Thinking in 360°: Humanoid Visual Search in the Wild
Authors: Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li,
Abstract summary: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360.<n>We propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360 panoramic image.<n>Our experiments first reveal that even top-tier proprietary models falter, achieving only 30% success in object and path search.
Score: 52.29500214210115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

Related papers

FlySearch: Exploring how vision-language models explore [5.7210882663967615]
We introduce FlySearch, a 3D, outdoor, environment for searching and navigating to objects in complex scenes.<n>We observe that state-of-the-art Vision-Language Models (VLMs) cannot reliably solve even the simplest exploration tasks.<n>We identify a set of central causes, ranging from vision, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning.
arXiv Detail & Related papers (2025-06-03T14:03:42Z)
HINT: Learning Complete Human Neural Representations from Limited Viewpoints [69.76947323932107]
We propose a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR.
arXiv Detail & Related papers (2024-05-30T05:43:09Z)
R2Human: Real-Time 3D Human Appearance Rendering from a Single Image [42.74145788079571]
R2Human is the first approach for real-time inference and rendering of 3D human appearance from a single image. We present an end-to-end network that performs high-fidelity color reconstruction of visible areas and provides reliable color inference for occluded regions.
arXiv Detail & Related papers (2023-12-10T08:59:43Z)
MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures [44.172804112944625]
We present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions.
arXiv Detail & Related papers (2023-12-05T18:50:12Z)
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering [126.00165445599764]
We present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. Our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. We construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps.
arXiv Detail & Related papers (2023-07-19T17:58:03Z)
RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars [157.82758221794452]
We present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar research. It contains massive data assets, with 243+ million complete head frames, and over 800k video sequences from 500 different identities. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks.
arXiv Detail & Related papers (2023-05-22T17:54:01Z)
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes. This paper aims to explore dense 3D representations for gait recognition in the wild. We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z)
Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild [96.08358373137438]
We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene. Our method runs on datasets without any scene- or object-level 3D supervision.
arXiv Detail & Related papers (2020-07-30T17:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.