Human-level 3D shape perception emerges from multi-view learning
- URL: http://arxiv.org/abs/2602.17650v1
- Date: Thu, 19 Feb 2026 18:56:05 GMT
- Title: Human-level 3D shape perception emerges from multi-view learning
- Authors: Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa,
- Abstract summary: We develop a modeling framework that predicts human 3D shape inferences for arbitrary objects.<n>We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data.<n>We find that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data.
- Score: 63.048728487674815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
Related papers
- A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments [36.264146096690375]
Mental rotation is a fundamental example of mental simulation and spatial world modelling in humans.<n>We propose a mechanistic model of human mental rotation, leveraging advances in deep, equivariant, and neuro-symbolic learning.<n>Our model captures well the performance, response times and behavior of participants in our and others' experiments.
arXiv Detail & Related papers (2025-12-15T16:43:50Z) - Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets.<n>We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z) - Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Approaching human 3D shape perception with neurally mappable models [15.090436065092716]
Humans effortlessly infer the 3D shape of objects.
None of current computational models capture the human ability to match object shape across viewpoints.
This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures.
arXiv Detail & Related papers (2023-08-22T09:29:05Z) - Evaluating alignment between humans and neural network representations in image-based learning tasks [5.657101730705275]
We tested how well the representations of $86$ pretrained neural network models mapped to human learning trajectories.<n>We found that while training dataset size was a core determinant of alignment with human choices, contrastive training with multi-modal data (text and imagery) was a common feature of currently publicly available models that predicted human generalisation.<n>In conclusion, pretrained neural networks can serve to extract representations for cognitive models, as they appear to capture some fundamental aspects of cognition that are transferable across tasks.
arXiv Detail & Related papers (2023-06-15T08:18:29Z) - CHORE: Contact, Human and Object REconstruction from a single RGB image [40.817960406002506]
CHORE is a novel method that learns to jointly reconstruct the human and the object from a single RGB image.
We compute a neural reconstruction of human and object represented implicitly with two unsigned distance fields.
Experiments show that our joint reconstruction learned with the proposed strategy significantly outperforms the SOTA.
arXiv Detail & Related papers (2022-04-05T18:38:06Z) - LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human
Bodies [78.17425779503047]
We propose a novel neural implicit representation for the human body.
It is fully differentiable and optimizable with disentangled shape and pose latent spaces.
Our model can be trained and fine-tuned directly on non-watertight raw data with well-designed losses.
arXiv Detail & Related papers (2021-11-30T04:10:57Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.