Related papers: Human-Like Coarse Object Representations in Vision Models

Human-Like Coarse Object Representations in Vision Models

URL: http://arxiv.org/abs/2602.12486v1
Date: Thu, 12 Feb 2026 23:59:58 GMT
Title: Human-Like Coarse Object Representations in Vision Models
Authors: Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman,
Abstract summary: Humans represent objects for intuitive physics with coarse, bodies'' that are largely unknown.<n>We optimize pixel-accurate masks that may misalign with such bodies.<n>We find that alignment with human behavior follows an inverse U-shaped curve.
Score: 7.548979981481746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.

Related papers

Human-level 3D shape perception emerges from multi-view learning [63.048728487674815]
We develop a modeling framework that predicts human 3D shape inferences for arbitrary objects.<n>We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data.<n>We find that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data.
arXiv Detail & Related papers (2026-02-19T18:56:05Z)
Towards aligned body representations in vision models [7.548979981481746]
We test whether vision models trained for segmentation develop comparable representations.<n>We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings.
arXiv Detail & Related papers (2025-11-29T07:25:32Z)
Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z)
Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment [0.14999444543328289]
We employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations.<n>We find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations.<n>Our results offer new insights into the role of linguistic information in acquiring precise object representations.
arXiv Detail & Related papers (2025-05-22T09:06:06Z)
Contour Integration Underlies Human-Like Vision [2.6716072974490794]
Humans perform at high accuracy, even with few object contours present.<n>Humans exhibit an integration bias -- a preference towards recognizing objects made up of directional fragments over directionless fragments.
arXiv Detail & Related papers (2025-04-07T16:45:06Z)
Learning Visibility for Robust Dense Human Body Estimation [78.37389398573882]
Estimating 3D human pose and shape from 2D images is a crucial yet challenging task. We learn dense human body estimation that is robust to partial observations. We obtain pseudo ground-truths of visibility labels from dense UV correspondences and train a neural network to predict visibility along with 3D coordinates.
arXiv Detail & Related papers (2022-08-23T00:01:05Z)
COAP: Compositional Articulated Occupancy of People [28.234772596912162]
We present a novel neural implicit representation for articulated human bodies. We employ a part-aware encoder-decoder architecture to learn neural articulated occupancy. Our method largely outperforms existing solutions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-04-13T06:02:20Z)
LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human Bodies [78.17425779503047]
We propose a novel neural implicit representation for the human body. It is fully differentiable and optimizable with disentangled shape and pose latent spaces. Our model can be trained and fine-tuned directly on non-watertight raw data with well-designed losses.
arXiv Detail & Related papers (2021-11-30T04:10:57Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
Deep Physics-aware Inference of Cloth Deformation for Monocular Human Performance Capture [84.73946704272113]
We show how integrating physics into the training process improves the learned cloth deformations and allows modeling clothing as a separate piece of geometry. Our approach leads to a significant improvement over current state-of-the-art methods and is thus a clear step towards realistic monocular capture of the entire deforming surface of a human clothed.
arXiv Detail & Related papers (2020-11-25T16:46:00Z)
Monocular Human Pose and Shape Reconstruction using Part Differentiable Rendering [53.16864661460889]
Recent works succeed in regression-based methods which estimate parametric models directly through a deep neural network supervised by 3D ground truth. In this paper, we introduce body segmentation as critical supervision. To improve the reconstruction with part segmentation, we propose a part-level differentiable part that enables part-based models to be supervised by part segmentation.
arXiv Detail & Related papers (2020-03-24T14:25:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.