Related papers: LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

URL: http://arxiv.org/abs/2506.21317v1
Date: Thu, 26 Jun 2025 14:32:56 GMT
Title: LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning
Authors: Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno,
Abstract summary: Current vision-language models (VLMs) are well-adapted for general visual understanding tasks.<n>We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes.<n>We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements.
Score: 1.820765907065129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.

Related papers

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling [0.0]
Current approaches to visual question answering often struggle with the precision required for scientific data interpretation.<n>We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.<n>Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
arXiv Detail & Related papers (2025-07-08T17:05:42Z)
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.<n>We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.<n>Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z)
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete semantic.<n>Our method improved the understanding performance by 4.8% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%.
arXiv Detail & Related papers (2024-11-26T03:33:52Z)
VAGUE: Visual Contexts Clarify Ambiguous Expressions [15.140825578254324]
We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent.<n>VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations.<n>Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent.
arXiv Detail & Related papers (2024-11-21T14:01:42Z)
Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models [1.9890559505377343]
Current vision-language multimodal models are well-adapted for general visual understanding tasks.<n>We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes.<n>We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate it on the benchmark, achieving significant improvements.
arXiv Detail & Related papers (2024-09-14T05:07:57Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning [29.037799937729687]
2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision. We propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities. Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications.
arXiv Detail & Related papers (2023-11-24T21:55:34Z)
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions. We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering" We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.