FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
- URL: http://arxiv.org/abs/2401.07669v1
- Date: Mon, 15 Jan 2024 13:27:34 GMT
- Title: FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
- Authors: Darshan Singh S and Zeeshan Khan and Makarand Tapaswi
- Abstract summary: We show that it is possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties.
We adapt CLIP efficiently on a high-quality, comprehensive, and relatively small dataset.
We learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented.
- Score: 19.08882495584709
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While contrastive language image pretraining (CLIP) have exhibited impressive
performance by learning highly semantic and generalized representations, recent
works have exposed a fundamental drawback in its syntactic properties, that
includes interpreting fine-grained attributes, actions, spatial relations,
states, and details that require compositional reasoning. One reason for this
is that natural captions often do not capture all the visual details of a
scene. This leads to unaddressed visual concepts being misattributed to the
wrong words. And the pooled image and text features, ends up acting as a bag of
words, hence losing the syntactic information. In this work, we ask: Is it
possible to enhance CLIP's fine-grained and syntactic abilities without
compromising its semantic properties? We show that this is possible by adapting
CLIP efficiently on a high-quality, comprehensive, and relatively small
dataset. We demonstrate our adaptation strategy on VidSitu, a video situation
recognition dataset annotated with verbs and rich semantic role labels (SRL).
We use the SRL and verb information to create rule-based detailed captions,
making sure they capture most of the visual concepts. Combined with hard
negatives and hierarchical losses, these annotations allow us to learn a
powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that
preserves semantic understanding while being detail-oriented. We evaluate on
five diverse vision-language tasks in both fine-tuning and zero-shot settings,
achieving consistent improvements over the base CLIP model.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Exploring Visual Interpretability for Contrastive Language-Image
Pre-training [23.569964756096986]
Contrastive Language-Image pre-training learns rich representations via readily available supervisions of natural language.
Visual interpretability of CLIP has not been studied yet.
We integrate above methods as Interpretable Contrastive Language-Image pre-training (ICLIP)
arXiv Detail & Related papers (2022-09-15T05:01:03Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - CLIP Meets Video Captioners: Attribute-Aware Representation Learning
Promotes Accurate Captioning [34.46948978082648]
ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation.
This paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions.
We introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes.
arXiv Detail & Related papers (2021-11-30T06:37:44Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.