You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
- URL: http://arxiv.org/abs/2312.05525v3
- Date: Sun, 14 Jul 2024 09:33:09 GMT
- Title: You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
- Authors: Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo,
- Abstract summary: Human-centric perception is a long-standing problem for computer vision.
This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP)
Human Query captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios.
- Score: 37.667147915777534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human-centric perception (e.g. detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at https://github.com/lishuhuai527/COCO-UniHuman.
Related papers
- CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z) - HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes [72.26829188852139]
HumanPCR is an evaluation suite for probing MLLMs' capacity about human-related visual contexts.<n>Human-P, HumanThought-C, and Human-R feature over 6,000 human-verified multiple choice questions.<n>Human-R offers a challenging manually curated video reasoning test.
arXiv Detail & Related papers (2025-08-19T09:52:04Z) - RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios [60.772871735598706]
RefHCM (Referring Human-Centric Model) is a framework to integrate a wide range of human-centric referring tasks.
RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens.
This work represents the first attempt to address referring human perceptions with a general-purpose framework.
arXiv Detail & Related papers (2024-12-19T08:51:57Z) - A Flexible Method for Behaviorally Measuring Alignment Between Human and Artificial Intelligence Using Representational Similarity Analysis [0.1957338076370071]
We adapted Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans.<n>We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels.
arXiv Detail & Related papers (2024-11-30T20:24:52Z) - Object-Centric Multi-Task Learning for Human Instances [8.035105819936808]
We explore a compact multi-task network architecture that maximally shares the parameters of the multiple tasks via object-centric learning.
We propose a novel query design to encode the human instance information effectively, called human-centric query (HCQ)
Experimental results show that the proposed multi-task network achieves comparable accuracy to state-of-the-art task-specific models.
arXiv Detail & Related papers (2023-03-13T01:10:50Z) - UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP)
UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z) - MDPose: Real-Time Multi-Person Pose Estimation via Mixture Density Model [27.849059115252008]
We propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints.
Our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints.
arXiv Detail & Related papers (2023-02-17T08:29:33Z) - Body Segmentation Using Multi-task Learning [1.0832844764942349]
We present a novel multi-task model for human segmentation/parsing that involves three tasks.
The main idea behind the proposed--Pose--DensePose model (or SPD for short) is to learn a better segmentation model by sharing knowledge across different, yet related tasks.
The performance of the model is analysed through rigorous experiments on the LIP and ATR datasets and in comparison to a recent (state-of-the-art) multi-task body-segmentation model.
arXiv Detail & Related papers (2022-12-13T13:06:21Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Distribution Matching for Heterogeneous Multi-Task Learning: a
Large-scale Face Study [75.42182503265056]
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm.
We deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems.
We build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks.
arXiv Detail & Related papers (2021-05-08T22:26:52Z) - End-to-end One-shot Human Parsing [91.5113227694443]
One-shot human parsing (OSHP) task requires parsing humans into an open set of classes defined by any test example.
End-to-end One-shot human Parsing Network (EOP-Net) proposed.
EOP-Net outperforms representative one-shot segmentation models by large margins.
arXiv Detail & Related papers (2021-05-04T01:35:50Z) - Differentiable Multi-Granularity Human Representation Learning for
Instance-Aware Human Semantic Parsing [131.97475877877608]
A new bottom-up regime is proposed to learn category-level human semantic segmentation and multi-person pose estimation in a joint and end-to-end manner.
It is a compact, efficient and powerful framework that exploits structural information over different human granularities.
Experiments on three instance-aware human datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
arXiv Detail & Related papers (2021-03-08T06:55:00Z) - Model-agnostic Fits for Understanding Information Seeking Patterns in
Humans [0.0]
In decision making tasks under uncertainty, humans display characteristic biases in seeking, integrating, and acting upon information relevant to the task.
Here, we reexamine data from previous carefully designed experiments, collected at scale, that measured and catalogued these biases in aggregate form.
We design deep learning models that replicate these biases in aggregate, while also capturing individual variation in behavior.
arXiv Detail & Related papers (2020-12-09T04:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.