Related papers: Representations in vision and language converge in a shared, multidimensional space of perceived similarities

Representations in vision and language converge in a shared, multidimensional space of perceived similarities

URL: http://arxiv.org/abs/2507.21871v1
Date: Tue, 29 Jul 2025 14:42:31 GMT
Title: Representations in vision and language converge in a shared, multidimensional space of perceived similarities
Authors: Katerina Marie Simkova, Adrien Doerig, Clayton Hickey, Ian Charest,
Abstract summary: We show that visual and linguistic similarity judgements converge at the behavioural level.<n>We also predict a remarkably similar network of fMRI brain responses evoked by viewing the natural scene images.<n>These findings demonstrate that human visual and linguistic similarity judgements are grounded in a shared, modality-agnostic representational structure.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans can effortlessly describe what they see, yet establishing a shared representational format between vision and language remains a significant challenge. Emerging evidence suggests that human brain representations in both vision and language are well predicted by semantic feature spaces obtained from large language models (LLMs). This raises the possibility that sensory systems converge in their inherent ability to transform their inputs onto shared, embedding-like representational space. However, it remains unclear how such a space manifests in human behaviour. To investigate this, sixty-three participants performed behavioural similarity judgements separately on 100 natural scene images and 100 corresponding sentence captions from the Natural Scenes Dataset. We found that visual and linguistic similarity judgements not only converge at the behavioural level but also predict a remarkably similar network of fMRI brain responses evoked by viewing the natural scene images. Furthermore, computational models trained to map images onto LLM-embeddings outperformed both category-trained and AlexNet controls in explaining the behavioural similarity structure. These findings demonstrate that human visual and linguistic similarity judgements are grounded in a shared, modality-agnostic representational structure that mirrors how the visual system encodes experience. The convergence between sensory and artificial systems suggests a common capacity of how conceptual representations are formed-not as arbitrary products of first order, modality-specific input, but as structured representations that reflect the stable, relational properties of the external world.

Related papers

Convergent transformations of visual representation in brains and models [0.0]
A fundamental question in cognitive neuroscience is what shapes visual perception: the external world's structure or the brain's internal architecture.<n>We show a convergent computational solution for visual encoding in both human and artificial vision, driven by the structure of the external world.
arXiv Detail & Related papers (2025-07-18T14:13:54Z)
Human-like conceptual representations emerge from language prediction [72.5875173689788]
Large language models (LLMs) trained exclusively through next-token prediction over language data exhibit remarkably human-like behaviors.<n>Are these models developing concepts akin to humans, and if so, how are such concepts represented and organized?<n>Our results demonstrate that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts.<n>These findings establish that structured, human-like conceptual representations can naturally emerge from language prediction without real-world grounding.
arXiv Detail & Related papers (2025-01-21T23:54:17Z)
Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages. We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Semantic Brain Decoding: from fMRI to conceptually similar image reconstruction of visual stimuli [0.29005223064604074]
We propose a novel approach to brain decoding that also relies on semantic and contextual similarity. We employ an fMRI dataset of natural image vision and create a deep learning decoding pipeline inspired by the existence of both bottom-up and top-down processes in human vision. We produce reconstructions of visual stimuli that match the original content very well on a semantic level, surpassing the state of the art in previous literature.
arXiv Detail & Related papers (2022-12-13T16:54:08Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
Words are all you need? Capturing human sensory similarity with textual descriptors [12.191617984664683]
We explore the relation between human similarity judgments and language. We introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance.
arXiv Detail & Related papers (2022-06-08T18:09:19Z)
Emergence of Machine Language: Towards Symbolic Intelligence with Neural Networks [73.94290462239061]
We propose to combine symbolism and connectionism principles by using neural networks to derive a discrete representation. By designing an interactive environment and task, we demonstrated that machines could generate a spontaneous, flexible, and semantic language.
arXiv Detail & Related papers (2022-01-14T14:54:58Z)
Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings. We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI. This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.