Related papers: Data or Language Supervision: What Makes CLIP Better than DINO?

Data or Language Supervision: What Makes CLIP Better than DINO?

URL: http://arxiv.org/abs/2510.11835v1
Date: Mon, 13 Oct 2025 18:34:58 GMT
Title: Data or Language Supervision: What Makes CLIP Better than DINO?
Authors: Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy,
Abstract summary: We show that CLIP captures high-level semantics, while DINO is more responsive to low-level features like colors and styles.<n>When integrated into vision-language models, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones.
Score: 50.59472280781008
License: http://creativecommons.org/licenses/by/4.0/
Abstract: CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

Related papers

Scaling Language-Free Visual Representation Learning [62.31591054289958]
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA)<n>This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data.<n>We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders.
arXiv Detail & Related papers (2025-04-01T17:59:15Z)
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [20.953645420787527]
We train a CLIP-like model with only a fraction of the computational cost compared to CLIP.<n>We achieve state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-12-20T20:46:48Z)
Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder [18.91969873367244]
We show that Generative Multimodal Large Language Models (MLLMs) achieve significantly higher accuracy than CLIP.<n>Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.
arXiv Detail & Related papers (2024-11-07T21:39:51Z)
Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z)
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z)
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision [26.13829720290035]
Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants.
arXiv Detail & Related papers (2022-03-11T08:41:00Z)
SLIP: Self-supervision meets Language-Image Pre-training [79.53764315471543]
We study whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. We find that SLIP enjoys the best of both worlds: better performance than self-supervision and language supervision.
arXiv Detail & Related papers (2021-12-23T18:07:13Z)
How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.