Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP
Limitations
- URL: http://arxiv.org/abs/2311.17938v1
- Date: Tue, 28 Nov 2023 19:24:07 GMT
- Title: Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP
Limitations
- Authors: Lei Fan, Jianxiong Zhou, Xiaoying Xing and Ying Wu
- Abstract summary: We introduce a novel agent for active open-vocabulary recognition.
The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features, without relying on class-specific knowledge.
- Score: 9.444540281544715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active recognition, which allows intelligent agents to explore observations
for better recognition performance, serves as a prerequisite for various
embodied AI tasks, such as grasping, navigation and room arrangements. Given
the evolving environment and the multitude of object classes, it is impractical
to include all possible classes during the training stage. In this paper, we
aim at advancing active open-vocabulary recognition, empowering embodied agents
to actively perceive and classify arbitrary objects. However, directly adopting
recent open-vocabulary classification models, like Contrastive Language Image
Pretraining (CLIP), poses its unique challenges. Specifically, we observe that
CLIP's performance is heavily affected by the viewpoint and occlusions,
compromising its reliability in unconstrained embodied perception scenarios.
Further, the sequential nature of observations in agent-environment
interactions necessitates an effective method for integrating features that
maintains discriminative strength for open-vocabulary classification. To
address these issues, we introduce a novel agent for active open-vocabulary
recognition. The proposed method leverages inter-frame and inter-concept
similarities to navigate agent movements and to fuse features, without relying
on class-specific knowledge. Compared to baseline CLIP model with 29.6%
accuracy on ShapeNet dataset, the proposed agent could achieve 53.3% accuracy
for open-vocabulary recognition, without any fine-tuning to the equipped CLIP
model. Additional experiments conducted with the Habitat simulator further
affirm the efficacy of our method.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Is CLIP the main roadblock for fine-grained open-world perception? [7.190567053576658]
Recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings.
We show that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space.
Our experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts.
arXiv Detail & Related papers (2024-04-04T15:47:30Z) - FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action
Recognition [30.15770881713811]
We introduce FROSTER, an effective framework for open-vocabulary action recognition.
Applying CLIP directly to the action recognition task is challenging due to the absence of temporal information in CLIP's pretraining.
We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings.
arXiv Detail & Related papers (2024-02-05T17:56:41Z) - Evidential Active Recognition: Intelligent and Prudent Open-World
Embodied Perception [21.639429724987902]
Active recognition enables robots to explore novel observations, thereby acquiring more information while circumventing undesired viewing conditions.
Most recognition modules are developed under the closed-world assumption, which makes them ill-equipped to handle unexpected inputs, such as the absence of the target object in the current observation.
We propose treating active recognition as a sequential evidence-gathering process, providing by-step uncertainty and reliable prediction under the evidence combination theory.
arXiv Detail & Related papers (2023-11-23T03:51:46Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - Incremental Object Detection with CLIP [36.478530086163744]
We propose a visual-language model such as CLIP to generate text feature embeddings for different class sets.
We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario.
We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance.
arXiv Detail & Related papers (2023-10-13T01:59:39Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr.
The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes.
We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z) - Open-set Adversarial Defense with Clean-Adversarial Mutual Learning [93.25058425356694]
This paper demonstrates that open-set recognition systems are vulnerable to adversarial samples.
Motivated by these observations, we emphasize the necessity of an Open-Set Adversarial Defense (OSAD) mechanism.
This paper proposes an Open-Set Defense Network with Clean-Adversarial Mutual Learning (OSDN-CAML) as a solution to the OSAD problem.
arXiv Detail & Related papers (2022-02-12T02:13:55Z) - MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition.
We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL)
In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.