Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment
- URL: http://arxiv.org/abs/2508.09850v1
- Date: Wed, 13 Aug 2025 14:29:12 GMT
- Title: Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment
- Authors: Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Valero Laparra, Jesus Malo,
- Abstract summary: Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored.<n>This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset.
- Score: 1.5146068448101746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.
Related papers
- LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z) - Contour Integration Underlies Human-Like Vision [2.6716072974490794]
Humans perform at high accuracy, even with few object contours present.<n>Humans exhibit an integration bias -- a preference towards recognizing objects made up of directional fragments over directionless fragments.
arXiv Detail & Related papers (2025-04-07T16:45:06Z) - Deep Domain Adaptation: A Sim2Real Neural Approach for Improving Eye-Tracking Systems [80.62854148838359]
Eye image segmentation is a critical step in eye tracking that has great influence over the final gaze estimate.
We use dimensionality-reduction techniques to measure the overlap between the target eye images and synthetic training data.
Our methods result in robust, improved performance when tackling the discrepancy between simulation and real-world data samples.
arXiv Detail & Related papers (2024-03-23T22:32:06Z) - A data-centric approach to class-specific bias in image data
augmentation [0.0]
Data augmentation (DA) enhances model generalization in computer vision but may introduce biases, impacting class accuracy unevenly.
We evaluate DA's class-specific bias across various datasets, including those distinct from ImageNet, through random cropping.
This suggests a nuanced approach to model selection, emphasizing bias mitigation.
arXiv Detail & Related papers (2024-03-07T00:32:47Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty [5.006068984003071]
Vision Transformers (ViT) have advanced computer vision, yet their efficacy in complex tasks like driving remains less explored.<n>This study enhances ViT by integrating human eye gaze, captured via eye-tracking, to increase prediction accuracy in driving scenarios under uncertainty.
arXiv Detail & Related papers (2023-08-26T22:48:06Z) - A Multidimensional Analysis of Social Biases in Vision Transformers [15.98510071115958]
We measure the impact of training data, model architecture, and training objectives on social biases in Vision Transformers (ViTs)
Our findings indicate that counterfactual augmentation training using diffusion-based image editing can mitigate biases, but does not eliminate them.
We find that larger models are less biased than smaller models, and that models trained using discriminative objectives are less biased than those trained using generative objectives.
arXiv Detail & Related papers (2023-08-03T09:03:40Z) - StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering"
We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures.
We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations [71.00754846434744]
We show that imperceptible additive perturbations can significantly alter the disparity map.
We show that, when used for adversarial data augmentation, our perturbations result in trained models that are more robust.
arXiv Detail & Related papers (2020-09-21T19:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.