Related papers: Referring Human Pose and Mask Estimation in the Wild

Referring Human Pose and Mask Estimation in the Wild

URL: http://arxiv.org/abs/2410.20508v1
Date: Sun, 27 Oct 2024 16:44:15 GMT
Title: Referring Human Pose and Mask Estimation in the Wild
Authors: Bo Miao, Mingtao Feng, Zijie Wu, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian,
Abstract summary: We introduce Referring Human Pose and Mask Estimation (R-HPM) in the wild. This task holds significant potential for human-centric applications such as assistive robotics and sports analysis. We propose the first end-to-end promptable approach named UniPHD for R-HPM.
Score: 57.12038065541915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Referring Human Pose and Mask Estimation (R-HPM) in the wild, where either a text or positional prompt specifies the person of interest in an image. This new task holds significant potential for human-centric applications such as assistive robotics and sports analysis. In contrast to previous works, R-HPM (i) ensures high-quality, identity-aware results corresponding to the referred person, and (ii) simultaneously predicts human pose and mask for a comprehensive representation. To achieve this, we introduce a large-scale dataset named RefHuman, which substantially extends the MS COCO dataset with additional text and positional prompt annotations. RefHuman includes over 50,000 annotated instances in the wild, each equipped with keypoint, mask, and prompt annotations. To enable prompt-conditioned estimation, we propose the first end-to-end promptable approach named UniPHD for R-HPM. UniPHD extracts multimodal representations and employs a proposed pose-centric hierarchical decoder to process (text or positional) instance queries and keypoint queries, producing results specific to the referred person. Extensive experiments demonstrate that UniPHD produces quality results based on user-friendly prompts and achieves top-tier performance on RefHuman val and MS COCO val2017. Data and Code: https://github.com/bo-miao/RefHuman

Related papers

Human Body Restoration with One-Step Diffusion Model and A New Benchmark [74.66514054623669]
We propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. We also propose emphOSDHuman, a novel one-step diffusion model for human body restoration.
arXiv Detail & Related papers (2025-02-03T14:48:40Z)
RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios [60.772871735598706]
RefHCM (Referring Human-Centric Model) is a framework to integrate a wide range of human-centric referring tasks. RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This work represents the first attempt to address referring human perceptions with a general-purpose framework.
arXiv Detail & Related papers (2024-12-19T08:51:57Z)
Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z)
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [63.946809247201905]
We introduce a new benchmark, namely SHIELD, to evaluate the ability of MLLMs on face spoofing and forgery detection. We design true/false and multiple-choice questions to evaluate multimodal face data in these two face security tasks. The results indicate that MLLMs hold substantial potential in the face security domain.
arXiv Detail & Related papers (2024-02-06T17:31:36Z)
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception [37.667147915777534]
Human-centric perception is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP) Human Query captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios.
arXiv Detail & Related papers (2023-12-09T10:36:43Z)
HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception [97.55089867970874]
We introduce masked image modeling (MIM) as a pre-training approach for this task. Motivated by this insight, we incorporate an intuitive human structure prior - human parts - into pre-training. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks.
arXiv Detail & Related papers (2023-10-31T17:56:11Z)
AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method. With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose. We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z)
Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part. We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge. Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z)
Human De-occlusion: Invisible Perception and Recovery for Humans [26.404444296924243]
We tackle the problem of human de-occlusion which reasons about occluded segmentation masks and invisible appearance content of humans. In particular, a two-stage framework is proposed to estimate the invisible portions and recover the content inside. Our method performs over the state-of-the-art techniques in both tasks of mask completion and content recovery.
arXiv Detail & Related papers (2021-03-22T05:54:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.