Related papers: You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

URL: http://arxiv.org/abs/2312.05525v3
Date: Sun, 14 Jul 2024 09:33:09 GMT
Title: You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
Authors: Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo,
Abstract summary: Human-centric perception is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP) Human Query captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios.
Score: 37.667147915777534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human-centric perception (e.g. detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at https://github.com/lishuhuai527/COCO-UniHuman.

Related papers

RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios [60.772871735598706]
RefHCM (Referring Human-Centric Model) is a framework to integrate a wide range of human-centric referring tasks. RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This work represents the first attempt to address referring human perceptions with a general-purpose framework.
arXiv Detail & Related papers (2024-12-19T08:51:57Z)
Object-Centric Multi-Task Learning for Human Instances [8.035105819936808]
We explore a compact multi-task network architecture that maximally shares the parameters of the multiple tasks via object-centric learning. We propose a novel query design to encode the human instance information effectively, called human-centric query (HCQ) Experimental results show that the proposed multi-task network achieves comparable accuracy to state-of-the-art task-specific models.
arXiv Detail & Related papers (2023-03-13T01:10:50Z)
UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP) UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z)
MDPose: Real-Time Multi-Person Pose Estimation via Mixture Density Model [27.849059115252008]
We propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints. Our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints.
arXiv Detail & Related papers (2023-02-17T08:29:33Z)
Body Segmentation Using Multi-task Learning [1.0832844764942349]
We present a novel multi-task model for human segmentation/parsing that involves three tasks. The main idea behind the proposed--Pose--DensePose model (or SPD for short) is to learn a better segmentation model by sharing knowledge across different, yet related tasks. The performance of the model is analysed through rigorous experiments on the LIP and ATR datasets and in comparison to a recent (state-of-the-art) multi-task body-segmentation model.
arXiv Detail & Related papers (2022-12-13T13:06:21Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
Distribution Matching for Heterogeneous Multi-Task Learning: a Large-scale Face Study [75.42182503265056]
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm. We deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems. We build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks.
arXiv Detail & Related papers (2021-05-08T22:26:52Z)
End-to-end One-shot Human Parsing [91.5113227694443]
One-shot human parsing (OSHP) task requires parsing humans into an open set of classes defined by any test example. End-to-end One-shot human Parsing Network (EOP-Net) proposed. EOP-Net outperforms representative one-shot segmentation models by large margins.
arXiv Detail & Related papers (2021-05-04T01:35:50Z)
Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing [131.97475877877608]
A new bottom-up regime is proposed to learn category-level human semantic segmentation and multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities. Experiments on three instance-aware human datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
arXiv Detail & Related papers (2021-03-08T06:55:00Z)
Model-agnostic Fits for Understanding Information Seeking Patterns in Humans [0.0]
In decision making tasks under uncertainty, humans display characteristic biases in seeking, integrating, and acting upon information relevant to the task. Here, we reexamine data from previous carefully designed experiments, collected at scale, that measured and catalogued these biases in aggregate form. We design deep learning models that replicate these biases in aggregate, while also capturing individual variation in behavior.
arXiv Detail & Related papers (2020-12-09T04:34:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.