Learning Situated Awareness in the Real World
- URL: http://arxiv.org/abs/2602.16682v1
- Date: Wed, 18 Feb 2026 18:22:52 GMT
- Title: Learning Situated Awareness in the Real World
- Authors: Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang,
- Abstract summary: SAW-Bench is a novel benchmark for evaluating egocentric situated awareness using real-world videos.<n>It probes a model's observer-centric understanding with six different awareness tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.
- Score: 63.75211123289058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
Related papers
- EgoSound: Benchmarking Sound Understanding in Egocentric Videos [68.1897133235638]
We introduce EgoSound, the first benchmark designed to evaluate egocentric sound understanding in MLLMs.<n>EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences.<n>It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning.
arXiv Detail & Related papers (2026-02-15T12:46:35Z) - Egocentric Bias in Vision-Language Models [11.385014698426088]
We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models.<n>The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective.<n>FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
arXiv Detail & Related papers (2026-02-10T03:51:00Z) - REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories [19.741468026765062]
We introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for embodied spatial reasoning.<n> REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints.<n>Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans.
arXiv Detail & Related papers (2025-11-30T05:20:22Z) - ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction [35.24704057622881]
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation.<n>We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction.
arXiv Detail & Related papers (2025-11-26T00:06:02Z) - Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z) - Spatial Mental Modeling from Limited Views [71.57140964322559]
Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap.<n>Using MindCube, we evaluate how well Vision Language Models (VLMs) build robust spatial mental models.<n>We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps.
arXiv Detail & Related papers (2025-06-26T16:38:19Z) - EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z) - EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World [12.699670048897085]
In human imitation learning, the imitator typically take the egocentric view as a benchmark, naturally transferring behaviors observed from an exocentric view to their owns.<n>We introduce EgoMe, which towards following the process of human imitation learning via the imitator's egocentric view in the real world.<n>Our dataset includes 7902 paired exo-ego videos spanning diverse daily behaviors in various real-world scenarios.
arXiv Detail & Related papers (2025-01-31T11:48:22Z) - EgoEnv: Human-centric environment representations from egocentric video [60.34649902578047]
First-person video highlights a camera-wearer's activities in the context of their persistent environment.
Current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space.
We present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings.
arXiv Detail & Related papers (2022-07-22T22:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.