Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances
- URL: http://arxiv.org/abs/2312.14400v1
- Date: Fri, 22 Dec 2023 03:01:41 GMT
- Title: Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances
- Authors: Cristian Rodriguez-Opazo and Edison Marrese-Taylor and Ehsan
Abbasnejad and Hamed Damirchi and Ignacio M. Jara and Felipe Bravo-Marquez
and Anton van den Hengel
- Abstract summary: Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
- Score: 49.631908848868505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pretraining (CLIP) stands out as a prominent
method for image representation learning. Various neural architectures,
spanning Transformer-based models like Vision Transformers (ViTs) to
Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve
as universal backbones across diverse vision tasks. Despite utilizing the same
data and training objectives, the effectiveness of representations learned by
these architectures raises a critical question. Our investigation explores the
differences in CLIP performance among these backbone architectures, revealing
significant disparities in their classifications. Notably, normalizing these
representations results in substantial performance variations. Our findings
showcase a remarkable possible synergy between backbone predictions that could
reach an improvement of over 20% through informed selection of the appropriate
backbone. Moreover, we propose a simple, yet effective approach to combine
predictions from multiple backbones, leading to a notable performance boost of
up to 6.34\%. We will release the code for reproducing the results.
Related papers
- Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
This paper explores the differences across various CLIP-trained vision backbones.
Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - SkelVIT: Consensus of Vision Transformers for a Lightweight
Skeleton-Based Action Recognition System [0.0]
Skeleton-based action recognition receives the attention of many researchers as it is robust to viewpoint and illumination changes.
With the emergence of deep learning models, it has become very popular to represent the skeleton data in pseudo-image form and apply CNN for action recognition.
Recently, attention networks, more specifically transformers have provided promising results in various vision problems.
arXiv Detail & Related papers (2023-11-14T11:38:38Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Target-aware Representation for Visual Tracking via Informative
Interactions [49.552877881662475]
We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking.
The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
arXiv Detail & Related papers (2022-01-07T16:22:27Z) - Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual
Representations [9.6221436745451]
We describe how we generate a dataset with over a billion images via large weakly-supervised pretraining.
We leverage Transformers to replace the traditional convolutional backbone.
We show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications.
arXiv Detail & Related papers (2021-08-12T17:58:56Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.