Related papers: Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

URL: http://arxiv.org/abs/2507.01201v5
Date: Thu, 28 Aug 2025 09:29:38 GMT
Title: Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models
Authors: Lauren Hyoseo Yoon, Yisong Yue, Been Kim,
Abstract summary: We show that the Joint Autoencoder Modulator (JAM) induces alignment even across independently trained representations.<n>Our findings offer theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.
Score: 30.07172193932125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.

Related papers

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z)
The Trinity of Consistency as a Defining Principle for General World Models [106.16462830681452]
General World Models are capable of learning, simulating, and reasoning about objective physical laws.<n>We propose a principled theoretical framework that defines the essential properties requisite for a General World Model.<n>Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.
arXiv Detail & Related papers (2026-02-26T16:15:55Z)
Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
Global and Local Entailment Learning for Natural World Imagery [7.874291189886743]
Radial Cross-Modal Embeddings (RCME) is a framework that enables the explicit modeling of transitivity-enforced entailment.<n>We develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life.
arXiv Detail & Related papers (2025-06-26T17:05:06Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
Discrete Markov Bridge [93.64996843697278]
We propose a novel framework specifically designed for discrete representation learning, called Discrete Markov Bridge.<n>Our approach is built upon two key components: Matrix Learning and Score Learning.
arXiv Detail & Related papers (2025-05-26T09:32:12Z)
Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model [39.58344147240552]
We investigate whether large vision-language models (VLMs) can compose capabilities across modalities or tasks under out-of-distribution conditions.<n>Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.
arXiv Detail & Related papers (2025-05-26T01:42:38Z)
Multi-Scale Probabilistic Generation Theory: A Unified Information-Theoretic Framework for Hierarchical Structure in Large Language Models [1.0117553823134735]
Large Language Models (LLMs) exhibit remarkable emergent abilities but remain poorly understood at a mechanistic level.<n>This paper introduces the Multi-Scale Probabilistic Generation Theory (MSPGT)<n>MSPGT posits that standard language modeling objectives implicitly optimize multi-scale information compression.
arXiv Detail & Related papers (2025-05-23T16:55:35Z)
Relative Overfitting and Accept-Reject Framework [5.465098504510676]
We propose an ensemble framework that governs how models are segmented to ensure performance improvement.<n>We detail the patterns of this framework within the domain of NLP and briefly describe its to other fields, such as computer vision (CV) and AI for science.
arXiv Detail & Related papers (2025-05-12T17:36:14Z)
SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models [12.26595705520937]
We introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints.<n>Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods.
arXiv Detail & Related papers (2025-03-11T10:17:32Z)
JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling [62.77347895550087]
We introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control.<n>Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures.<n>To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline.
arXiv Detail & Related papers (2024-12-29T14:18:35Z)
Latent Functional Maps: a spectral framework for representation alignment [34.20582953800544]
We introduce a multi-purpose framework to the representation learning community, which allows to: (i) compare different spaces in an interpretable way and measure their intrinsic similarity; (ii) find correspondences between them, both in unsupervised and weakly supervised settings, and (iii) to effectively transfer representations between distinct spaces. We validate our framework on various applications, ranging from stitching to retrieval tasks, and on multiple modalities, demonstrating that Latent Functional Maps can serve as a swiss-army knife for representation alignment.
arXiv Detail & Related papers (2024-06-20T10:43:28Z)
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z)
Universal Information Extraction as Unified Semantic Matching [54.19974454019611]
We decouple information extraction into two abilities, structuring and conceptualizing, which are shared by different tasks and schemas. Based on this paradigm, we propose to universally model various IE tasks with Unified Semantic Matching framework. In this way, USM can jointly encode schema and input text, uniformly extract substructures in parallel, and controllably decode target structures on demand.
arXiv Detail & Related papers (2023-01-09T11:51:31Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables. We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph. Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z)
Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.