Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment
- URL: http://arxiv.org/abs/2602.00653v1
- Date: Sat, 31 Jan 2026 10:57:46 GMT
- Title: Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment
- Authors: Lukas Kuhn, Giuseppe Serra, Florian Buettner,
- Abstract summary: We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization.<n>We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR.<n>Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
- Score: 12.336161969869567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
Related papers
- Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought [55.65577137924979]
We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
arXiv Detail & Related papers (2026-02-27T12:04:07Z) - Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z) - One Last Attention for Your Vision-Language Model [42.872184600248914]
We propose textbfRational textbfAdaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning.<n> RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix.<n>Experiments show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.
arXiv Detail & Related papers (2025-07-21T10:35:32Z) - Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training [78.60953331455565]
PRIOR is a vision-language pre-training approach that prioritizes image-related tokens through differential weighting in the NTP loss.<n>We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP.
arXiv Detail & Related papers (2025-05-13T21:27:52Z) - Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [5.242869847419834]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data.<n>This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Understanding Prompt Tuning for V-L Models Through the Lens of Neural
Collapse [47.89674843370092]
We propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations.
NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods.
arXiv Detail & Related papers (2023-06-28T06:37:03Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - A Simplified Framework for Contrastive Learning for Node Representations [2.277447144331876]
We investigate the potential of deploying contrastive learning in combination with Graph Neural Networks for embedding nodes in a graph.
We show that the quality of the resulting embeddings and training time can be significantly improved by a simple column-wise postprocessing of the embedding matrix.
This modification yields improvements in downstream classification tasks of up to 1.5% and even beats existing state-of-the-art approaches on 6 out of 8 different benchmarks.
arXiv Detail & Related papers (2023-05-01T02:04:36Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.