ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
- URL: http://arxiv.org/abs/2504.18152v1
- Date: Fri, 25 Apr 2025 08:05:32 GMT
- Title: ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
- Authors: Yi-Xing Peng, Qize Yang, Yu-Ming Tang, Shenghao Fu, Kun-Yu Lin, Xihan Wei, Wei-Shi Zheng,
- Abstract summary: ActionArt is a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding.<n>Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios.<n>We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions.
- Score: 31.481969919049472
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each accompanied by detailed annotations that meticulously label every limb movement. We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions. Experimental results indicate that, while current large multimodal models perform commendably on various tasks, they often fall short in achieving fine-grained understanding. We attribute this limitation to the scarcity of meticulously annotated data, which is both costly and difficult to scale manually. Since manual annotations are costly and hard to scale, we propose proxy tasks to enhance the model perception ability in both spatial and temporal dimensions. These proxy tasks are carefully crafted to be driven by data automatically generated from existing MLLMs, thereby reducing the reliance on costly manual labels. Experimental results show that the proposed proxy tasks significantly narrow the gap toward the performance achieved with manually annotated fine-grained data.
Related papers
- Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.<n>In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.<n>We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models [1.9890559505377343]
We introduce a new method for generating such data by integrating human keypoints with traditional visual features like captions and bounding boxes.
Our approach produces datasets designed for fine-tuning models to excel in human-centric activities.
Experimental results show an overall improvement of 21.18% compared to the original LLaVA-7B model.
arXiv Detail & Related papers (2024-09-14T05:07:57Z) - Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data.
Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks.
We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z) - Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos [16.93979476655776]
A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types.
We introduce a simple yet effective enhancement - a masked diffusion model.
We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
arXiv Detail & Related papers (2023-09-14T03:25:37Z) - Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife.
Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z) - Human Action Recognition Based on Multi-scale Feature Maps from Depth
Video Sequences [12.30399970340689]
We present a novel framework focusing on multi-scale motion information to recognize human actions from depth video sequences.
We employ depth motion images (DMI) as the templates to generate the multi-scale static representation of actions.
We extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features.
arXiv Detail & Related papers (2021-01-19T13:46:42Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z) - Human in Events: A Large-Scale Benchmark for Human-centric Video
Analysis in Complex Events [106.19047816743988]
We present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve.
It contains a record number of poses (>1M), the largest number of action instances (>56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time.
Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation.
arXiv Detail & Related papers (2020-05-09T18:24:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.