ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
- URL: http://arxiv.org/abs/2406.17876v1
- Date: Tue, 25 Jun 2024 18:35:13 GMT
- Title: ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
- Authors: Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa Vitiello,
- Abstract summary: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task.
In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective.
- Score: 0.2714641498775158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
Related papers
- Are We Done with Object-Centric Learning? [65.67948794110212]
Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene.
With recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently.
We address the OOD generalization challenge caused by spurious background cues through the lens of OCL.
arXiv Detail & Related papers (2025-04-09T17:59:05Z) - DiffCLIP: Differential Attention Meets CLIP [57.396578974401734]
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures.
With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
arXiv Detail & Related papers (2025-03-09T14:04:09Z) - CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation [3.1667055223489786]
Contrastive Language-Image Pre-training models excel in zero-shot classification, yet face challenges in complex multi-object scenarios.
This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO.
Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects.
arXiv Detail & Related papers (2025-02-27T07:34:42Z) - Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study [3.1667055223489786]
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks.
This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments.
arXiv Detail & Related papers (2025-02-27T07:03:10Z) - Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition [1.2499537119440243]
We tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) to classify objects based solely on descriptive attributes, excluding object class names.
We release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning.
We introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes.
arXiv Detail & Related papers (2024-12-18T15:28:08Z) - Quantifying and Enabling the Interpretability of CLIP-like Models [19.459369149558405]
We conduct this study on six different CLIP models from OpenAI and OpenCLIP.
Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties.
Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts.
arXiv Detail & Related papers (2024-09-10T15:19:40Z) - C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.
Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - Prototypical Contrastive Learning-based CLIP Fine-tuning for Object
Re-identification [13.090873217313732]
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID)
We first analyze the role prompt learning in CLIP-ReID and identify its limitations.
Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning.
arXiv Detail & Related papers (2023-10-26T08:12:53Z) - Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot
Anomaly Localization [63.61093388441298]
Contrastive Language-Image Pre-training models have shown promising performance on zero-shot visual recognition tasks.
In this work, we propose AnoCLIP for zero-shot anomaly localization.
arXiv Detail & Related papers (2023-08-30T10:35:36Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Label Words are Anchors: An Information Flow Perspective for
Understanding In-Context Learning [77.7070536959126]
In-context learning (ICL) emerges as a promising capability of large language models (LLMs)
In this paper, we investigate the working mechanism of ICL through an information flow lens.
We introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL.
arXiv Detail & Related papers (2023-05-23T15:26:20Z) - HOICLIP: Efficient Knowledge Transfer for HOI Detection with
Vision-Language Models [30.279621764192843]
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions.
Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors.
We propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization.
arXiv Detail & Related papers (2023-03-28T07:54:54Z) - CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue.
We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.