Related papers: SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

URL: http://arxiv.org/abs/2602.23353v1
Date: Thu, 26 Feb 2026 18:55:06 GMT
Title: SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport
Authors: Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata,
Abstract summary: Platonic Representation Hypothesis posits that neural networks converge toward a shared statistical model of the world.<n>Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers.<n>We ask whether meaningful alignment can be achieved with substantially less supervision.<n>We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data.
Score: 43.640561199880274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

Related papers

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z)
Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion [31.189038928192648]
Co2S is a semi-supervised RS segmentation framework that fuses priors from vision-language models and self-supervised models.<n>An explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries.<n>Experiments on six popular datasets demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-12-28T18:24:19Z)
Semi-Supervised Contrastive Learning with Orthonormal Prototypes [1.478364697333309]
dimensional collapse, where embeddings converge into a lower-dimensional space, poses a significant challenge.<n>We propose CLOP, a novel semi-supervised loss function designed to prevent dimensional collapse by promoting the formation of linear subspaces among class embeddings.<n>We show that CLOP improves performance in image classification and object detection tasks while also exhibiting greater stability across different learning rates and batch sizes.
arXiv Detail & Related papers (2025-11-27T13:26:59Z)
TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces [10.86297454943578]
We present a semi-supervised fine-tuning framework to address the challenges of training for a limited amount of labelled data.<n> Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions.
arXiv Detail & Related papers (2025-03-10T20:56:54Z)
Bridging Critical Gaps in Convergent Learning: How Representational Alignment Evolves Across Layers, Training, and Distribution Shifts [1.9458156037869137]
convergent learning is the degree to which neural systems arrive at similar internal representations.<n>We present a large-scale audit of convergent learning spanning dozens of vision models and thousands of layer-pair comparisons.<n>Findings fill critical gaps in our understanding of representational convergence, with implications for neuroscience and AI.
arXiv Detail & Related papers (2025-02-26T00:04:24Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z)
Adversarial Lagrangian Integrated Contrastive Embedding for Limited Size Datasets [8.926248371832852]
This study presents a novel adversarial Lagrangian integrated contrastive embedding (ALICE) method for small-sized datasets. The accuracy improvement and training convergence of the proposed pre-trained adversarial transfer are shown. A novel adversarial integrated contrastive model using various augmentation techniques is investigated.
arXiv Detail & Related papers (2022-10-06T23:59:28Z)
Mixed Graph Contrastive Network for Semi-Supervised Node Classification [63.924129159538076]
We propose a novel graph contrastive learning method, termed Mixed Graph Contrastive Network (MGCN)<n>In our method, we improve the discriminative capability of the latent embeddings by an unperturbed augmentation strategy and a correlation reduction mechanism.<n>By combining the two settings, we extract rich supervision information from both the abundant nodes and the rare yet valuable labeled nodes for discriminative representation learning.
arXiv Detail & Related papers (2022-06-06T14:26:34Z)
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph. We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z)
Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.