C-CLIP: Contrastive Image-Text Encoders to Close the
Descriptive-Commentative Gap
- URL: http://arxiv.org/abs/2309.03921v1
- Date: Wed, 6 Sep 2023 19:03:49 GMT
- Title: C-CLIP: Contrastive Image-Text Encoders to Close the
Descriptive-Commentative Gap
- Authors: William Theisen and Walter Scheirer
- Abstract summary: The interplay between the image and comment on a social media post is one of high importance for understanding its overall message.
Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text.
The current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language.
We show that training contrastive image-text encoders on explicitly commentative pairs results in large improvements in retrieval results.
- Score: 0.5439020425819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The interplay between the image and comment on a social media post is one of
high importance for understanding its overall message. Recent strides in
multimodal embedding models, namely CLIP, have provided an avenue forward in
relating image and text. However the current training regime for CLIP models is
insufficient for matching content found on social media, regardless of site or
language. Current CLIP training data is based on what we call ``descriptive''
text: text in which an image is merely described. This is something rarely seen
on social media, where the vast majority of text content is ``commentative'' in
nature. The captions provide commentary and broader context related to the
image, rather than describing what is in it. Current CLIP models perform poorly
on retrieval tasks where image-caption pairs display a commentative
relationship. Closing this gap would be beneficial for several important
application areas related to social media. For instance, it would allow groups
focused on Open-Source Intelligence Operations (OSINT) to further aid efforts
during disaster events, such as the ongoing Russian invasion of Ukraine, by
easily exposing data to non-technical users for discovery and analysis. In
order to close this gap we demonstrate that training contrastive image-text
encoders on explicitly commentative pairs results in large improvements in
retrieval results, with the results extending across a variety of non-English
languages.
Related papers
- EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning [38.30565103892611]
In this paper, we work towards the textbfEntity-centric textbfImage-textbfText textbfMatching (EITM) problem.
The challenge of this task mainly lies in the larger semantic gap in entity association modeling.
We devise a multimodal attentive contrastive learning framework to adapt EITM problem, developing a model named EntityCLIP.
arXiv Detail & Related papers (2024-10-23T12:12:56Z) - Enhancing Image Retrieval : A Comprehensive Study on Photo Search using
the CLIP Mode [0.27195102129095]
Photo search has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model.
This abstract summarizes the foundational principles of CLIP and highlights its potential impact on advancing the field of photo search.
arXiv Detail & Related papers (2024-01-24T17:35:38Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.