ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in
Situation Recognition
- URL: http://arxiv.org/abs/2307.00586v3
- Date: Mon, 11 Sep 2023 09:43:35 GMT
- Title: ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in
Situation Recognition
- Authors: Debaditya Roy, Dhruv Verma, Basura Fernando
- Abstract summary: Situation Recognition is the task of generating a structured summary of what is happening in an image using an activity verb.
We leverage the CLIP foundational model that has learned the context of images via language descriptions.
Our cross-attention-based Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a large margin of 14.1% on semantic role labelling.
- Score: 20.000253437661
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Situation Recognition is the task of generating a structured summary of what
is happening in an image using an activity verb and the semantic roles played
by actors and objects. In this task, the same activity verb can describe a
diverse set of situations as well as the same actor or object category can play
a diverse set of semantic roles depending on the situation depicted in the
image. Hence a situation recognition model needs to understand the context of
the image and the visual-linguistic meaning of semantic roles. Therefore, we
leverage the CLIP foundational model that has learned the context of images via
language descriptions. We show that deeper-and-wider multi-layer perceptron
(MLP) blocks obtain noteworthy results for the situation recognition task by
using CLIP image and text embedding features and it even outperforms the
state-of-the-art CoFormer, a Transformer-based model, thanks to the external
implicit visual-linguistic knowledge encapsulated by CLIP and the expressive
power of modern MLP block designs. Motivated by this, we design a
cross-attention-based Transformer using CLIP visual tokens that model the
relation between textual roles and visual entities. Our cross-attention-based
Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a
large margin of 14.1\% on semantic role labelling (value) for top-1 accuracy
using imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-art
situation localization performance.} We will make the code publicly available.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos [18.308072018844122]
Situation recognition refers to the ability of an agent to identify and understand various situations or contexts based on available information and sensory inputs.
We propose ClipSitu, which harnesses CLIP-based image, verb, and role embeddings to predict nouns fulfilling all the roles associated with a verb.
We show that situational summaries empower our ClipSitu models to produce structured descriptions with reduced ambiguity compared to generic captions.
arXiv Detail & Related papers (2024-07-30T08:39:20Z) - pOps: Photo-Inspired Diffusion Operators [55.93078592427929]
pOps is a framework that trains semantic operators directly on CLIP image embeddings.
We show that pOps can be used to learn a variety of photo-inspired operators with distinct semantic meanings.
arXiv Detail & Related papers (2024-06-03T13:09:32Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation.
We decompose the image representation as a sum across individual image patches, model layers, and attention heads.
We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z) - Prompting Visual-Language Models for Dynamic Facial Expression
Recognition [14.783257517376041]
This paper presents a novel visual-language model called DFER-CLIP.
It is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition.
It achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.
arXiv Detail & Related papers (2023-08-25T13:52:05Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.