Related papers: VISUALCENT: Visual Human Analysis using Dynamic Centroid Representation

VISUALCENT: Visual Human Analysis using Dynamic Centroid Representation

URL: http://arxiv.org/abs/2504.19032v1
Date: Sat, 26 Apr 2025 21:58:56 GMT
Title: VISUALCENT: Visual Human Analysis using Dynamic Centroid Representation
Authors: Niaz Ahmad, Youngmoon Lee, Guanghui Wang,
Abstract summary: We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis.<n>For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment.<n> Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second.
Score: 8.486534291290559
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis. VISUALCENT leverages centroid based bottom up keypoint detection paradigm and uses Keypoint Heatmap incorporating Disk Representation and KeyCentroid to identify the optimal keypoint coordinates. For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment. Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second. The implementation is available on the project page.

Related papers

You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping [119.41166438439313]
YOEO is a single-stage method that outputs instance segmentation and NPCS representations in an end-to-end manner.<n>We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid.<n>We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz.
arXiv Detail & Related papers (2025-06-06T03:49:20Z)
Keypoints as Dynamic Centroids for Unified Human Pose and Segmentation [19.109607441709418]
Keypoints as Dynamic Centroid (KDC) is a new centroid-based representation for unified human pose estimation and instance-level segmentation.<n>KDC adopts a bottom-up paradigm to generate keypoint heatmaps for both easily distinguishable and complex keypoints.<n>It exploits high-confidence keypoints as dynamic centroids in the embedding space to generate MaskCentroids.
arXiv Detail & Related papers (2025-05-17T20:05:34Z)
Robust Human Registration with Body Part Segmentation on Noisy Point Clouds [73.00876572870787]
We introduce a hybrid approach that incorporates body-part segmentation into the mesh fitting process.<n>Our method first assigns body part labels to individual points, which then guide a two-step SMPL-X fitting.<n>We demonstrate that the fitted human mesh can refine body part labels, leading to improved segmentation.
arXiv Detail & Related papers (2025-04-04T17:17:33Z)
Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM) This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z)
HOKEM: Human and Object Keypoint-based Extension Module for Human-Object Interaction Detection [1.2183405753834557]
This paper presents the human and object keypoint-based extension module (HOKEM) as an easy-to-use extension module to improve the accuracy of the conventional detection models. Experiments using the HOI dataset, V-COCO, showed that HOKEM boosted the accuracy of an appearance-based model by a large margin.
arXiv Detail & Related papers (2023-06-25T14:40:26Z)
MDPose: Real-Time Multi-Person Pose Estimation via Mixture Density Model [27.849059115252008]
We propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints. Our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints.
arXiv Detail & Related papers (2023-02-17T08:29:33Z)
2D Human Pose Estimation with Explicit Anatomical Keypoints Structure Constraints [15.124606575017621]
We present a novel 2D human pose estimation method with explicit anatomical keypoints structure constraints. Our proposed model can be plugged in the most existing bottom-up or top-down human pose estimation methods. Our methods perform favorably against the most existing bottom-up and top-down human pose estimation methods.
arXiv Detail & Related papers (2022-12-05T11:01:43Z)
Spatiotemporal k-means [39.98633724527769]
We propose a twotemporal clustering method called k-means (STk) that is able to analyze multi-scale clusters. We show how STkM can be extended to more complex machine learning tasks, particularly unsupervised region of interest detection and tracking in videos.
arXiv Detail & Related papers (2022-11-10T04:40:31Z)
HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Group-Skeleton-Based Human Action Recognition in Complex Events [15.649778891665468]
We propose a novel group-skeleton-based human action recognition method in complex events. This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons. Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.
arXiv Detail & Related papers (2020-11-26T13:19:14Z)
GID-Net: Detecting Human-Object Interaction with Global and Instance Dependency [67.95192190179975]
We introduce a two-stage trainable reasoning mechanism, referred to as GID block. GID-Net is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch. We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-11T11:58:43Z)
Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information. To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy. We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.