ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
- URL: http://arxiv.org/abs/2406.17876v1
- Date: Tue, 25 Jun 2024 18:35:13 GMT
- Title: ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
- Authors: Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa Vitiello,
- Abstract summary: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task.
In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective.
- Score: 0.2714641498775158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
Related papers
- Quantifying and Enabling the Interpretability of CLIP-like Models [19.459369149558405]
We conduct this study on six different CLIP models from OpenAI and OpenCLIP.
Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties.
Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts.
arXiv Detail & Related papers (2024-09-10T15:19:40Z) - C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.
Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z) - Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Label Words are Anchors: An Information Flow Perspective for
Understanding In-Context Learning [77.7070536959126]
In-context learning (ICL) emerges as a promising capability of large language models (LLMs)
In this paper, we investigate the working mechanism of ICL through an information flow lens.
We introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL.
arXiv Detail & Related papers (2023-05-23T15:26:20Z) - HOICLIP: Efficient Knowledge Transfer for HOI Detection with
Vision-Language Models [30.279621764192843]
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions.
Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors.
We propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization.
arXiv Detail & Related papers (2023-03-28T07:54:54Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.