CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2411.06869v1
- Date: Mon, 11 Nov 2024 11:08:26 GMT
- Title: CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models
- Authors: Junho Kim, Hyungjin Chung, Byung-Hoon Kim,
- Abstract summary: Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints.
Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated.
We introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE.
- Score: 18.121331575626023
- License:
- Abstract: Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. However, the optimal use of textual descriptions for keypoints remains an underexplored area. In this work, we introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. We conduct extensive experiments to systematically explore the design space of LLM-based CAPE, investigating factors such as choosing the optimal description for keypoints, neural network architectures, and training strategies. Thanks to the advanced reasoning capabilities of the pre-trained MLLM, CapeLLM demonstrates superior generalization and robust performance. Our approach sets a new state-of-the-art on the MP-100 benchmark in the challenging 1-shot setting, marking a significant advancement in the field of category-agnostic pose estimation.
Related papers
- KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension [31.283133365170052]
We introduce Semantic Keypoint, which aims to comprehend keypoints across different task scenarios.
We also introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy.
KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations.
arXiv Detail & Related papers (2024-11-04T06:42:24Z) - A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - SCAPE: A Simple and Strong Category-Agnostic Pose Estimator [6.705257644513057]
Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner.
We introduce two key modules: a global keypoint feature perceptor to inject global semantic information into support keypoints, and a keypoint attention refiner to enhance inter-node correlation between keypoints.
SCAPE outperforms prior arts by 2.2 and 1.3 PCK under 1-shot and 5-shot settings with faster inference speed and lighter model capacity.
arXiv Detail & Related papers (2024-07-18T13:02:57Z) - Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization [59.06663981902496]
Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization.
We investigate two indispensable characteristics that the LLMs-based QFS models should be harnessed, Lengthy Document Summarization and Efficiently Fine-grained Query-LLM Alignment.
These innovations pave the way for broader application and accessibility in the field of QFS technology.
arXiv Detail & Related papers (2024-07-15T07:14:56Z) - Meta-Point Learning and Refining for Category-Agnostic Pose Estimation [46.98479393474727]
Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary classes given a few support images annotated with keypoints.
We propose a novel framework for CAPE based on such potential keypoints (named as meta-points)
arXiv Detail & Related papers (2024-03-20T14:54:33Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - Activating the Discriminability of Novel Classes for Few-shot
Segmentation [48.542627940781095]
We propose to activate the discriminability of novel classes explicitly in both the feature encoding stage and the prediction stage for segmentation.
In the prediction stage for segmentation, we learn an Self-Refined Online Foreground-Background classifier (SROFB), which is able to refine itself using the high-confidence pixels of query image.
arXiv Detail & Related papers (2022-12-02T12:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.