Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
- URL: http://arxiv.org/abs/2412.09115v2
- Date: Mon, 17 Feb 2025 17:50:21 GMT
- Title: Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
- Authors: Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo,
- Abstract summary: Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization.
Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents?
We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories.
- Score: 44.51229445138653
- License:
- Abstract: Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.
Related papers
- Linking Robustness and Generalization: A k* Distribution Analysis of Concept Clustering in Latent Space for Vision Models [56.89974470863207]
This article uses the k* Distribution, a local neighborhood analysis method, to examine the learned latent space at the level of individual concepts.
We introduce skewness-based true and approximate metrics for interpreting individual concepts to assess the overall quality of vision models' latent space.
arXiv Detail & Related papers (2024-08-17T01:43:51Z) - Tracing Representation Progression: Analyzing and Enhancing Layer-Wise Similarity [20.17288970927518]
We study the similarity of representations between the hidden layers of individual transformers.
We show that representations across layers are positively correlated, with similarity increasing when layers get closer.
We propose an aligned training method to improve the effectiveness of shallow layer.
arXiv Detail & Related papers (2024-06-20T16:41:09Z) - Latent Communication in Artificial Neural Networks [2.5947832846531886]
This dissertation focuses on the universality and reusability of neural representations.
A salient observation from our research is the emergence of similarities in latent representations.
arXiv Detail & Related papers (2024-06-16T17:13:58Z) - Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose
Estimation [63.199549837604444]
3D human pose estimation approaches leverage different forms of strong (2D/3D pose) or weak (multi-view or depth) paired supervision.
We cast 3D pose learning as a self-supervised adaptation problem that aims to transfer the task knowledge from a labeled source domain to a completely unpaired target.
We evaluate different self-adaptation settings and demonstrate state-of-the-art 3D human pose estimation performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-05T03:52:57Z) - On the Viability of Monocular Depth Pre-training for Semantic Segmentation [48.29060171161375]
We study whether pre-training on geometric tasks is viable for downstream transfer to semantic tasks.
We find that monocular depth is a viable form of pre-training for semantic segmentation, validated by improvements over common baselines.
arXiv Detail & Related papers (2022-03-26T04:27:28Z) - Improving Neural Predictivity in the Visual Cortex with Gated Recurrent
Connections [0.0]
We aim to shift the focus on architectures that take into account lateral recurrent connections, a ubiquitous feature of the ventral visual stream, to devise adaptive receptive fields.
In order to increase the robustness of our approach and the biological fidelity of the activations, we employ specific data augmentation techniques.
arXiv Detail & Related papers (2022-03-22T17:27:22Z) - Learning Dynamics via Graph Neural Networks for Human Pose Estimation
and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame.
Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information.
Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Do Saliency Models Detect Odd-One-Out Targets? New Datasets and Evaluations [15.374430656911498]
We investigate singleton detection, which can be thought of as a canonical example of salience.
We show that nearly all saliency algorithms do not adequately respond to singleton targets in synthetic and natural images.
arXiv Detail & Related papers (2020-05-13T20:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.