ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
- URL: http://arxiv.org/abs/2511.20937v1
- Date: Wed, 26 Nov 2025 00:06:02 GMT
- Title: ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
- Authors: Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li,
- Abstract summary: Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation.<n>We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction.
- Score: 35.24704057622881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
Related papers
- Learning Situated Awareness in the Real World [63.75211123289058]
SAW-Bench is a novel benchmark for evaluating egocentric situated awareness using real-world videos.<n>It probes a model's observer-centric understanding with six different awareness tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.
arXiv Detail & Related papers (2026-02-18T18:22:52Z) - EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models [31.768426199719816]
We propose EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions.<n>We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives.<n>We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations.
arXiv Detail & Related papers (2026-02-04T13:04:56Z) - ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying [15.728211622542267]
ViThinker is a framework that enables vision-language models to autonomously generate decision tokens triggering the synthesis of expert-aligned visual features on demand.<n>ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls.
arXiv Detail & Related papers (2026-02-02T22:29:57Z) - ECHO: Ego-Centric modeling of Human-Object interactions [71.17118015822699]
ECHO (Ego-Centric modeling of Human-Object interactions) is developed.<n>It recovers three modalities: human pose, object motion, and contact from such minimal observation.<n>It outperforms existing methods that do not offer the same flexibility.
arXiv Detail & Related papers (2025-08-29T12:12:22Z) - Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection [3.656114607436271]
Human-object interaction detection (HOID) refers to localizing interactive human-object pairs in images and identifying the interactions.<n>We build a top-down framework named Funnel-HOI inspired by the human tendency to grasp well-defined concepts first and then associate them with abstract concepts during scene understanding.<n>A novel asymmetric co-attention mechanism mines these cues utilizing multimodal information (incorporating zero-shot capabilities) and yields stronger interaction representations at the encoder level.
arXiv Detail & Related papers (2025-07-16T20:47:24Z) - EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views [51.53089073920215]
Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception.
Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view.
We present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance.
arXiv Detail & Related papers (2024-05-22T14:03:48Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - UniAR: A Unified model for predicting human Attention and Responses on visual content [12.281060227170792]
We propose UniAR -- a unified model of human attention and preference behavior across diverse visual content.
We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks.
Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
arXiv Detail & Related papers (2023-12-15T19:57:07Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Computing a human-like reaction time metric from stable recurrent vision
models [11.87006916768365]
We sketch a general-purpose methodology to construct computational accounts of reaction times from a stimulus-computable, task-optimized model.
We demonstrate that our metric aligns with patterns of human reaction times for stimulus manipulations across four disparate visual decision-making tasks.
This work paves the way for exploring the temporal alignment of model and human visual strategies in the context of various other cognitive tasks.
arXiv Detail & Related papers (2023-06-20T14:56:02Z) - Modelling Human Visual Motion Processing with Trainable Motion Energy
Sensing and a Self-attention Network [1.9458156037869137]
We propose an image-computable model of human motion perception by bridging the gap between biological and computer vision models.
This model architecture aims to capture the computations in V1-MT, the core structure for motion perception in the biological visual system.
In silico neurophysiology reveals that our model's unit responses are similar to mammalian neural recordings regarding motion pooling and speed tuning.
arXiv Detail & Related papers (2023-05-16T04:16:07Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.