Related papers: Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

URL: http://arxiv.org/abs/2411.05195v2
Date: Thu, 20 Feb 2025 07:22:53 GMT
Title: Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
Authors: Siting Li, Pang Wei Koh, Simon Shaolei Du,
Abstract summary: We show that Generative Multimodal Large Language Models (MLLMs) achieve significantly higher accuracy than CLIP.<n>Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.
Score: 18.91969873367244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision encoder does not embed essential information for these tasks. However, we find that this is not always the case: The encoder gathers query-relevant visual information, while CLIP fails to extract it. In particular, we show that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy than CLIP in many of these tasks using the same vision encoder and weights, indicating that these Generative MLLMs perceive more -- as they extract and utilize visual information more effectively. We conduct a series of controlled experiments and reveal that their success is attributed to multiple key design choices, including patch tokens, position embeddings, and prompt-based weighting. On the other hand, enhancing the training data alone or applying a stronger text encoder does not suffice to solve the task, and additional text tokens offer little benefit. Interestingly, we find that fine-grained visual reasoning is not exclusive to generative models trained by an autoregressive loss: When converted into CLIP-like encoders by contrastive finetuning, these MLLMs still outperform CLIP under the same cosine similarity-based evaluation protocol. Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.

Related papers

CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [17.05291662808873]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z)
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models [53.13731845500678]
We introduce a novel metric, $Rank_e$, to quantify the effect of vision encoder's prior knowledge on MLLM performance. We propose VisPRE, a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs.
arXiv Detail & Related papers (2025-03-23T11:33:09Z)
Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study [3.1667055223489786]
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments.
arXiv Detail & Related papers (2025-02-27T07:03:10Z)
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models [18.87130615326443]
Vision-language models (VLMs) serve as foundation models for image captioning and text-to-image generation. Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding.
arXiv Detail & Related papers (2024-12-11T05:37:04Z)
Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute. We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z)
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities. CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure. We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Modeling Caption Diversity in Contrastive Vision-Language Pretraining [48.7603274197994]
We introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders.
arXiv Detail & Related papers (2024-04-30T01:19:18Z)
MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders. This technique introduces a fusion network to unify the processing of outputs from different visual experts. In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z)
Do Vision and Language Encoders Represent the World Similarly? [22.70701869402434]
Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. We find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training.
arXiv Detail & Related papers (2024-01-10T15:51:39Z)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models [36.41816380074965]
We investigate the effectiveness of different vision encoders within Large Language Models (MLLMs) Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. We propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging.
arXiv Detail & Related papers (2023-10-13T02:41:55Z)
VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection [58.47940430618352]
We propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD.
arXiv Detail & Related papers (2023-08-22T14:58:36Z)
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z)
Masked Contrastive Representation Learning [6.737710830712818]
This work presents Masked Contrastive Representation Learning (MACRL) for self-supervised visual pre-training. We adopt an asymmetric setting for the siamese network (i.e., encoder-decoder structure in both branches), where one branch with higher mask ratio and stronger data augmentation, while the other adopts weaker data corruptions. In our experiments, MACRL presents superior results on various vision benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and two other ImageNet subsets.
arXiv Detail & Related papers (2022-11-11T05:32:28Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness. We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions. We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z)
Introspective Deep Metric Learning for Image Retrieval [80.29866561553483]
We argue that a good similarity model should consider the semantic discrepancies with caution to better deal with ambiguous images for more robust training. We propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively. The proposed IDML framework improves the performance of deep metric learning through uncertainty modeling and attains state-of-the-art results on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets.
arXiv Detail & Related papers (2022-05-09T17:51:44Z)
How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.