Hulk: A Universal Knowledge Translator for Human-Centric Tasks
- URL: http://arxiv.org/abs/2312.01697v4
- Date: Fri, 22 Mar 2024 02:47:00 GMT
- Title: Hulk: A Universal Knowledge Translator for Human-Centric Tasks
- Authors: Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang,
- Abstract summary: We present Hulk, the first multimodal human-centric generalist model.
It addresses 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning.
Hulk achieves state-of-the-art performance in 11 benchmarks.
- Score: 69.8518392427151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk.
Related papers
- HumanVLM: Foundation for Human-Scene Vision-Language Model [3.583459930633303]
Human-scene vision-language tasks are increasingly prevalent in diverse social applications.
This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM)
In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance.
arXiv Detail & Related papers (2024-11-05T12:14:57Z) - A Unified Framework for Human-centric Point Cloud Video Understanding [23.91555808792291]
Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds.
We propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding.
Our method achieves state-of-the-art performance on various human-related tasks, including action recognition and 3D pose estimation.
arXiv Detail & Related papers (2024-03-29T07:53:06Z) - CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman.
CapHuman encodes identity features and then learns to align them into the latent space.
We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z) - EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
Embodied AI [88.03089807278188]
EmbodiedScan is a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding.
It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories.
Building upon this database, we introduce a baseline framework named Embodied Perceptron.
It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities.
arXiv Detail & Related papers (2023-12-26T18:59:11Z) - You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception [37.667147915777534]
Human-centric perception is a long-standing problem for computer vision.
This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP)
Human Query captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios.
arXiv Detail & Related papers (2023-12-09T10:36:43Z) - Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife.
Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z) - HumanBench: Towards General Human-centric Perception with Projector
Assisted Pretraining [75.1086193340286]
It is desirable to have a general pretrain model for versatile human-centric downstream tasks.
We propose a textbfHumanBench based on existing datasets to evaluate on the common ground the generalization abilities of different pretraining methods.
Our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets.
arXiv Detail & Related papers (2023-03-10T02:57:07Z) - UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP)
UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.