CLAMP: Prompt-based Contrastive Learning for Connecting Language and
Animal Pose
- URL: http://arxiv.org/abs/2206.11752v3
- Date: Mon, 26 Jun 2023 00:46:10 GMT
- Title: CLAMP: Prompt-based Contrastive Learning for Connecting Language and
Animal Pose
- Authors: Xu Zhang, Wen Wang, Zhe Chen, Yufei Xu, Jing Zhang, Dacheng Tao
- Abstract summary: We introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose effectively.
The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training.
Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings.
- Score: 70.59906971581192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Animal pose estimation is challenging for existing image-based methods
because of limited training data and large intra- and inter-species variances.
Motivated by the progress of visual-language research, we propose that
pre-trained language models (e.g., CLIP) can facilitate animal pose estimation
by providing rich prior knowledge for describing animal keypoints in text.
However, we found that building effective connections between pre-trained
language models and visual animal keypoints is non-trivial since the gap
between text-based descriptions and keypoint-based visual features about animal
pose can be significant. To address this issue, we introduce a novel
prompt-based Contrastive learning scheme for connecting Language and AniMal
Pose (CLAMP) effectively. The CLAMP attempts to bridge the gap by adapting the
text prompts to the animal keypoints during network training. The adaptation is
decomposed into spatial-aware and feature-aware processes, and two novel
contrastive losses are devised correspondingly. In practice, the CLAMP enables
the first cross-modal animal pose estimation paradigm. Experimental results
show that our method achieves state-of-the-art performance under the
supervised, few-shot, and zero-shot settings, outperforming image-based methods
by a large margin.
Related papers
- CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - UniAP: Towards Universal Animal Perception in Vision via Few-shot
Learning [24.157933537030086]
We introduce UniAP, a novel Universal Animal Perception model that enables cross-species perception among various visual tasks.
By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species.
arXiv Detail & Related papers (2023-08-19T09:13:46Z) - LAMP: Leveraging Language Prompts for Multi-person Pose Estimation [8.983326069321981]
We propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation)
By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels.
This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation.
arXiv Detail & Related papers (2023-07-21T23:00:43Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - A Transformer-Based Contrastive Learning Approach for Few-Shot Sign
Language Recognition [0.0]
We propose a novel Contrastive Transformer-based model, which demonstrate to learn rich representations from body key points sequences.
Experiments showed that the model could generalize well and achieved competitive results for sign classes never seen in the training process.
arXiv Detail & Related papers (2022-04-05T11:42:55Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Ontology-guided Semantic Composition for Zero-Shot Learning [36.84707487983917]
We propose to model the compositional and expressive semantics of class labels by an OWL (Web Ontology Language) ontology.
The effectiveness has been verified by some primary experiments on animal image classification and visual question answering.
arXiv Detail & Related papers (2020-06-30T15:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.