Related papers: KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension

KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension

URL: http://arxiv.org/abs/2411.01846v1
Date: Mon, 04 Nov 2024 06:42:24 GMT
Title: KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension
Authors: Jie Yang, Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Ruimao Zhang,
Abstract summary: We introduce Semantic Keypoint, which aims to comprehend keypoints across different task scenarios. We also introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy. KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations.
Score: 31.283133365170052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have greatly improved their abilities in image understanding. However, these models often struggle with grasping pixel-level semantic details, e.g., the keypoints of an object. To bridge this gap, we introduce the novel challenge of Semantic Keypoint Comprehension, which aims to comprehend keypoints across different task scenarios, including keypoint semantic understanding, visual prompt-based keypoint detection, and textual prompt-based keypoint detection. Moreover, we introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy to effectively address these challenges. KptLLM underscores the initial discernment of semantics in keypoints, followed by the precise determination of their positions through a chain-of-thought process. With several carefully designed modules, KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations. Our extensive experiments demonstrate KptLLM's superiority in various keypoint detection benchmarks and its unique semantic capabilities in interpreting keypoints.

Related papers

QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models [50.51641024244313]
We investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images.<n>Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC)<n>We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks.
arXiv Detail & Related papers (2025-11-05T05:49:48Z)
A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z)
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z)
KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model [31.59640895434506]
Keypoints, as structure-aware, pixel-level, and compact representations of objects, play a crucial role in applications such as fine-grained image analysis, object retrieval, and behavior recognition.<n>In this paper, we propose KptLLM++, a novel multimodal large language model that specifically designed for generic keypoint comprehension.<n>By unifying keypoint detection across varied contexts, KptLLM++ establishes itself as an advanced interface, fostering more effective human-AI collaboration.
arXiv Detail & Related papers (2025-07-15T08:52:28Z)
VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT) VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks. We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models [18.121331575626023]
Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. We introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE.
arXiv Detail & Related papers (2024-11-11T11:08:26Z)
KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z)
Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z)
Meta-Point Learning and Refining for Category-Agnostic Pose Estimation [46.98479393474727]
Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary classes given a few support images annotated with keypoints. We propose a novel framework for CAPE based on such potential keypoints (named as meta-points)
arXiv Detail & Related papers (2024-03-20T14:54:33Z)
Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM) This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Keyphrase Extraction with Dynamic Graph Convolutional Networks and Diversified Inference [50.768682650658384]
Keyphrase extraction (KE) aims to summarize a set of phrases that accurately express a concept or a topic covered in a given document. Recent Sequence-to-Sequence (Seq2Seq) based generative framework is widely used in KE task, and it has obtained competitive performance on various benchmarks. In this paper, we propose to adopt the Dynamic Graph Convolutional Networks (DGCN) to solve the above two problems simultaneously.
arXiv Detail & Related papers (2020-10-24T08:11:23Z)
MOPT: Multi-Object Panoptic Tracking [33.77171216778909]
We introduce a novel perception task denoted as multi-object panoptic tracking (MOPT) MOPT allows for exploiting pixel-level semantic information of 'thing' and'stuff' classes, temporal coherence, and pixel-level associations over time. We present extensive quantitative and qualitative evaluations of both vision-based and LiDAR-based MOPT that demonstrate encouraging results.
arXiv Detail & Related papers (2020-04-17T11:45:28Z)
Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information. To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy. We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.