Related papers: Alignment among Language, Vision and Action Representations

Alignment among Language, Vision and Action Representations

URL: http://arxiv.org/abs/2601.22948v1
Date: Fri, 30 Jan 2026 13:12:07 GMT
Title: Alignment among Language, Vision and Action Representations
Authors: Nicola Milano, Stefano Nolfi,
Abstract summary: We show that linguistic, visual, and action representations converge toward partially shared semantic structures.<n>These findings indicate that linguistic, visual, and action representations converge toward partially shared semantic structures.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental question in cognitive science and AI concerns whether different learning modalities: language, vision, and action, give rise to distinct or shared internal representations. Traditional views assume that models trained on different data types develop specialized, non-transferable representations. However, recent evidence suggests unexpected convergence: models optimized for distinct tasks may develop similar representational geometries. We investigate whether this convergence extends to embodied action learning by training a transformer-based agent to execute goal-directed behaviors in response to natural language instructions. Using behavioral cloning on the BabyAI platform, we generated action-grounded language embeddings shaped exclusively by sensorimotor control requirements. We then compared these representations with those extracted from state-of-the-art large language models (LLaMA, Qwen, DeepSeek, BERT) and vision-language models (CLIP, BLIP). Despite substantial differences in training data, modality, and objectives, we observed robust cross-modal alignment. Action representations aligned strongly with decoder-only language models and BLIP (precision@15: 0.70-0.73), approaching the alignment observed among language models themselves. Alignment with CLIP and BERT was significantly weaker. These findings indicate that linguistic, visual, and action representations converge toward partially shared semantic structures, supporting modality-independent semantic organization and highlighting potential for cross-domain transfer in embodied AI systems.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set [80.50996301430108]
The alignment of vision-language representations endows current Vision-Language Models with strong multi-modal reasoning capabilities.<n>We propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations.<n>For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts.
arXiv Detail & Related papers (2025-10-24T10:29:31Z)
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [29.745218855471787]
Tokenization is a necessary component within the current architecture of many language mod-els.<n>We argue that tokenization is necessary for reasonably human-like language performance.<n>We discuss implications for architectural choices, meaning construction, the primacy of language for thought.
arXiv Detail & Related papers (2024-12-14T18:18:52Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models [7.511284868070148]
We investigate whether integration of visuo-linguistic information leads to representations that are more aligned with human brain activity.<n>Our findings indicate an advantage of multimodal models in predicting human brain activations.
arXiv Detail & Related papers (2024-07-25T10:08:37Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Constructing Word-Context-Coupled Space Aligned with Associative Knowledge Relations for Interpretable Language Modeling [0.0]
The black-box structure of the deep neural network in pre-trained language models seriously limits the interpretability of the language modeling process. A Word-Context-Coupled Space (W2CSpace) is proposed by introducing the alignment processing between uninterpretable neural representation and interpretable statistical logic. Our language model can achieve better performance and highly credible interpretable ability compared to related state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T09:26:02Z)
Learning Semantic Textual Similarity via Topic-informed Discrete Latent Variables [17.57873577962635]
We develop a topic-informed discrete latent variable model for semantic textual similarity. Our model learns a shared latent space for sentence-pair representation via vector quantization. We show that our model is able to surpass several strong neural baselines in semantic textual similarity tasks.
arXiv Detail & Related papers (2022-11-07T15:09:58Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
The Grammar-Learning Trajectories of Neural Language Models [42.32479280480742]
We show that neural language models acquire linguistic phenomena in a similar order, despite having different end performances over the data. Results suggest that NLMs exhibit consistent developmental'' stages.
arXiv Detail & Related papers (2021-09-13T16:17:23Z)
SLM: Learning a Discourse Language Representation with Sentence Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation. We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.