OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection
- URL: http://arxiv.org/abs/2409.19899v1
- Date: Mon, 30 Sep 2024 02:58:05 GMT
- Title: OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection
- Authors: Changsheng Lu, Zheyuan Liu, Piotr Koniusz,
- Abstract summary: We open the prompt diversity from three aspects: modality, semantics (seen v.s. unseen), and language.
We propose a novel OpenKD model which leverages multimodal prototype set to support both visual and textual prompting.
- Score: 35.57926269889791
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Exploiting the foundation models (e.g., CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (e.g., ``the nose of a cat''), or the visual prompt (e.g., support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on taking multimodal prompt is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like ``Can you detect the nose and ears of a cat?'' In this work, we open the prompt diversity from three aspects: modality, semantics (seen v.s. unseen), and language, to enable a more generalized zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated from visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also found large language model (LLM) is a good parser, which achieves over 96% accuracy to parse keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways to deal with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD.
Related papers
- KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension [31.283133365170052]
We introduce Semantic Keypoint, which aims to comprehend keypoints across different task scenarios.
We also introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy.
KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations.
arXiv Detail & Related papers (2024-11-04T06:42:24Z) - Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval [13.315951821189538]
Scene text retrieval aims to find all images containing the query text from an image gallery.
Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes.
We propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval.
arXiv Detail & Related papers (2024-08-01T10:25:14Z) - CountGD: Multi-Modal Open-World Counting [54.88804890463491]
This paper aims to improve the generality and accuracy of open-vocabulary object counting in images.
We introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both.
arXiv Detail & Related papers (2024-07-05T16:20:48Z) - TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions.
By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z) - X-Pose: Detecting Any Keypoints [28.274913140048003]
X-Pose is a novel framework for multi-object keypoint detection in images.
UniKPT is a large-scale dataset of keypoint detection datasets.
X-Pose achieves notable improvements over state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods.
arXiv Detail & Related papers (2023-10-12T17:22:58Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - Towards Unified Scene Text Spotting based on Sequence Generation [4.437335677401287]
We propose a UNIfied scene Text Spotter, called UNITS.
Our model unifies various detection formats, including quadrilaterals and polygons.
We apply starting-point prompting to enable the model to extract texts from an arbitrary starting point.
arXiv Detail & Related papers (2023-04-07T01:28:08Z) - Open-Vocabulary Point-Cloud Object Detection without 3D Annotation [62.18197846270103]
The goal of open-vocabulary 3D point-cloud detection is to identify novel objects based on arbitrary textual descriptions.
We develop a point-cloud detector that can learn a general representation for localizing various objects.
We also propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text.
arXiv Detail & Related papers (2023-04-03T08:22:02Z) - AE TextSpotter: Learning Visual and Linguistic Representation for
Ambiguous Text Spotting [98.08853679310603]
This work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter)
AE TextSpotter learns both visual and linguistic features to significantly reduce ambiguity in text detection.
To our knowledge, it is the first time to improve text detection by using a language model.
arXiv Detail & Related papers (2020-08-03T08:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.