Aligning Machine and Human Visual Representations across Abstraction Levels
- URL: http://arxiv.org/abs/2409.06509v3
- Date: Tue, 29 Oct 2024 09:30:12 GMT
- Title: Aligning Machine and Human Visual Representations across Abstraction Levels
- Authors: Lukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C. Mozer, Klaus-Robert Müller, Thomas Unterthiner, Andrew K. Lampinen,
- Abstract summary: Deep neural networks have achieved success across a wide range of applications, including as models of human behavior in vision tasks.
However, neural network training and human learning differ in fundamental ways, and neural networks often fail to generalize as robustly as humans do.
We highlight a key misalignment between vision models and humans: whereas human conceptual knowledge is hierarchically organized from fine- to coarse-scale distinctions, model representations do not accurately capture all these levels of abstraction.
To address this misalignment, we first train a teacher model to imitate human judgments, then transfer human-like structure from its representations into pretrained state-of-the
- Score: 42.86478924838503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks have achieved success across a wide range of applications, including as models of human behavior in vision tasks. However, neural network training and human learning differ in fundamental ways, and neural networks often fail to generalize as robustly as humans do, raising questions regarding the similarity of their underlying representations. What is missing for modern learning systems to exhibit more human-like behavior? We highlight a key misalignment between vision models and humans: whereas human conceptual knowledge is hierarchically organized from fine- to coarse-scale distinctions, model representations do not accurately capture all these levels of abstraction. To address this misalignment, we first train a teacher model to imitate human judgments, then transfer human-like structure from its representations into pretrained state-of-the-art vision foundation models. These human-aligned models more accurately approximate human behavior and uncertainty across a wide range of similarity tasks, including a new dataset of human judgments spanning multiple levels of semantic abstractions. They also perform better on a diverse set of machine learning tasks, increasing generalization and out-of-distribution robustness. Thus, infusing neural networks with additional human knowledge yields a best-of-both-worlds representation that is both more consistent with human cognition and more practically useful, thus paving the way toward more robust, interpretable, and human-like artificial intelligence systems.
Related papers
- Achieving More Human Brain-Like Vision via Human EEG Representational Alignment [1.811217832697894]
We present 'Re(presentational)Al(ignment)net', a vision model aligned with human brain activity based on non-invasive EEG.
Our innovative image-to-brain multi-layer encoding framework advances human neural alignment by optimizing multiple model layers.
Our findings suggest that ReAlnet represents a breakthrough in bridging the gap between artificial and human vision, and paving the way for more brain-like artificial intelligence systems.
arXiv Detail & Related papers (2024-01-30T18:18:41Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Evaluating alignment between humans and neural network representations in image-based learning tasks [5.657101730705275]
We tested how well the representations of $86$ pretrained neural network models mapped to human learning trajectories.
We found that while training dataset size was a core determinant of alignment with human choices, contrastive training with multi-modal data (text and imagery) was a common feature of currently publicly available models that predicted human generalisation.
In conclusion, pretrained neural networks can serve to extract representations for cognitive models, as they appear to capture some fundamental aspects of cognition that are transferable across tasks.
arXiv Detail & Related papers (2023-06-15T08:18:29Z) - A Theory of Human-Like Few-Shot Learning [14.271690184738205]
We derive a theory of human-like few-shot learning from von-Neuman-Landauer's principle.
We find that deep generative model like variational autoencoder (VAE) can be used to approximate our theory.
arXiv Detail & Related papers (2023-01-03T11:22:37Z) - Human alignment of neural network representations [22.671101285994013]
We investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses.
We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses.
We find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not.
arXiv Detail & Related papers (2022-11-02T15:23:16Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z) - Cognitive architecture aided by working-memory for self-supervised
multi-modal humans recognition [54.749127627191655]
The ability to recognize human partners is an important social skill to build personalized and long-term human-robot interactions.
Deep learning networks have achieved state-of-the-art results and demonstrated to be suitable tools to address such a task.
One solution is to make robots learn from their first-hand sensory data with self-supervision.
arXiv Detail & Related papers (2021-03-16T13:50:24Z) - Human-Understandable Decision Making for Visual Recognition [30.30163407674527]
We propose a new framework to train a deep neural network by incorporating the prior of human perception into the model learning process.
The effectiveness of our proposed model is evaluated on two classical visual recognition tasks.
arXiv Detail & Related papers (2021-03-05T02:07:33Z) - Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and
Reasoning [78.13740873213223]
Bongard problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems.
We propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning.
arXiv Detail & Related papers (2020-10-02T03:19:46Z) - Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI)
This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.