Representation Potentials of Foundation Models for Multimodal Alignment: A Survey
- URL: http://arxiv.org/abs/2510.05184v1
- Date: Sun, 05 Oct 2025 21:48:51 GMT
- Title: Representation Potentials of Foundation Models for Multimodal Alignment: A Survey
- Authors: Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, Yun Fu,
- Abstract summary: Foundation models learn highly transferable representations through large-scale pretraining on diverse data.<n>We investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information.
- Score: 39.88306901879684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.
Related papers
- Universally Converging Representations of Matter Across Scientific Foundation Models [5.309886698585678]
We show that representations learned by nearly sixty scientific models are highly aligned across a wide range of chemical systems.<n>On inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space.<n>Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models.
arXiv Detail & Related papers (2025-12-03T12:47:06Z) - Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges [54.669838624278924]
Foundation models have transformed natural language processing and computer vision.<n>With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data.<n>This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective.
arXiv Detail & Related papers (2025-10-27T03:40:00Z) - Vision Generalist Model: A Survey [87.49797517847132]
We provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field.<n>We take a brief excursion into related domains, shedding light on their interconnections and potential synergies.
arXiv Detail & Related papers (2025-06-11T17:23:41Z) - Multi-Modal Foundation Models for Computational Pathology: A Survey [32.25958653387204]
Foundation models have emerged as a powerful paradigm in computational pathology (CPath)<n>We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression.<n>We analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs.
arXiv Detail & Related papers (2025-03-12T06:03:33Z) - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.<n> multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.<n>This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z) - Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models [24.579822095003685]
We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA)<n>We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.<n>We identify a promising path to leverage the strengths of both paradigms.
arXiv Detail & Related papers (2024-07-22T12:26:08Z) - Disentangling Representations through Multi-task Learning [0.0]
We provide experimental and theoretical results guaranteeing the emergence of disentangled representations in agents that optimally solve classification tasks.<n>We experimentally validate these predictions in RNNs trained to multi-task, which learn disentangled representations in the form of continuous attractors.<n>We find that transformers are particularly suited for disentangling representations, which might explain their unique world understanding abilities.
arXiv Detail & Related papers (2024-07-15T21:32:58Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - Cross-Modal Discrete Representation Learning [73.68393416984618]
We present a self-supervised learning framework that learns a representation that captures finer levels of granularity across different modalities.
Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities.
arXiv Detail & Related papers (2021-06-10T00:23:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.