Towards aligned body representations in vision models
- URL: http://arxiv.org/abs/2512.00365v1
- Date: Sat, 29 Nov 2025 07:25:32 GMT
- Title: Towards aligned body representations in vision models
- Authors: Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman,
- Abstract summary: We test whether vision models trained for segmentation develop comparable representations.<n>We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings.
- Score: 7.548979981481746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human physical reasoning relies on internal "body" representations - coarse, volumetric approximations that capture an object's extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.
Related papers
- Human-level 3D shape perception emerges from multi-view learning [63.048728487674815]
We develop a modeling framework that predicts human 3D shape inferences for arbitrary objects.<n>We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data.<n>We find that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data.
arXiv Detail & Related papers (2026-02-19T18:56:05Z) - Human-Like Coarse Object Representations in Vision Models [7.548979981481746]
Humans represent objects for intuitive physics with coarse, bodies'' that are largely unknown.<n>We optimize pixel-accurate masks that may misalign with such bodies.<n>We find that alignment with human behavior follows an inverse U-shaped curve.
arXiv Detail & Related papers (2026-02-12T23:59:58Z) - Revealing emergent human-like conceptual representations from language prediction [90.73285317321312]
Large language models (LLMs) trained solely through next-token prediction on text exhibit strikingly human-like behaviors.<n>Are these models developing concepts akin to those of humans?<n>We found that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts.
arXiv Detail & Related papers (2025-01-21T23:54:17Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ? [5.076961098583674]
We introduce a novel adversarial dataset to provide evidence for the dual thinking framework in human vision.<n>Our psychophysical studies show the presence of multiple inferences in rapid succession.<n>Analysis of errors shows that the early stopping of visual processing can result in missing relevant information.
arXiv Detail & Related papers (2024-06-11T05:50:34Z) - Human-Like Geometric Abstraction in Large Pre-trained Neural Networks [6.650735854030166]
We revisit empirical results in cognitive science on geometric visual processing.
We identify three key biases in geometric visual processing.
We test tasks from the literature that probe these biases in humans and find that large pre-trained neural network models used in AI demonstrate more human-like abstract geometric processing.
arXiv Detail & Related papers (2024-02-06T17:59:46Z) - Evaluating alignment between humans and neural network representations in image-based learning tasks [5.657101730705275]
We tested how well the representations of $86$ pretrained neural network models mapped to human learning trajectories.<n>We found that while training dataset size was a core determinant of alignment with human choices, contrastive training with multi-modal data (text and imagery) was a common feature of currently publicly available models that predicted human generalisation.<n>In conclusion, pretrained neural networks can serve to extract representations for cognitive models, as they appear to capture some fundamental aspects of cognition that are transferable across tasks.
arXiv Detail & Related papers (2023-06-15T08:18:29Z) - Intrinsic Physical Concepts Discovery with Object-Centric Predictive
Models [86.25460882547581]
We introduce the PHYsical Concepts Inference NEtwork (PHYCINE), a system that infers physical concepts in different abstract levels without supervision.
We show that object representations containing the discovered physical concepts variables could help achieve better performance in causal reasoning tasks.
arXiv Detail & Related papers (2023-03-03T11:52:21Z) - 3D Shape Perception Integrates Intuitive Physics and
Analysis-by-Synthesis [39.933479524063976]
We propose a framework for 3D shape perception that explains perception in both typical and atypical cases.
Our results suggest that bottom-up deep neural network models are not fully adequate accounts of human shape perception.
arXiv Detail & Related papers (2023-01-09T23:11:41Z) - Human alignment of neural network representations [28.32452075196472]
We investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses.<n>We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses.<n>We find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not.
arXiv Detail & Related papers (2022-11-02T15:23:16Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human
Bodies [78.17425779503047]
We propose a novel neural implicit representation for the human body.
It is fully differentiable and optimizable with disentangled shape and pose latent spaces.
Our model can be trained and fine-tuned directly on non-watertight raw data with well-designed losses.
arXiv Detail & Related papers (2021-11-30T04:10:57Z) - Physion: Evaluating Physical Prediction from Vision in Humans and
Machines [46.19008633309041]
We present a visual and physical prediction benchmark that precisely measures this capability.
We compare an array of algorithms on their ability to make diverse physical predictions.
We find that graph neural networks with access to the physical state best capture human behavior.
arXiv Detail & Related papers (2021-06-15T16:13:39Z) - Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions.
We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors.
Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.