Related papers: PromptHMR: Promptable Human Mesh Recovery

PromptHMR: Promptable Human Mesh Recovery

URL: http://arxiv.org/abs/2504.06397v1
Date: Tue, 08 Apr 2025 19:38:04 GMT
Title: PromptHMR: Promptable Human Mesh Recovery
Authors: Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, Muhammed Kocabas,
Abstract summary: Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction.<n>We present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts.<n>Our method processes full images to maintain scene context and accepts multiple input modalities.
Score: 68.65788167859817
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

Related papers

Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting [4.568580817155409]
Amodal completion is crucial for understanding human-object interactions in computer vision and robotics.<n>We develop a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI.<n>Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios.
arXiv Detail & Related papers (2025-08-01T08:33:45Z)
Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z)
Multimodal Prompt Alignment for Facial Expression Recognition [24.470095812039286]
MPA-FER provides fine-grained semantic guidance to the learning process of prompted visual features.<n>Our framework outperforms state-of-the-art methods on three FER benchmark datasets.
arXiv Detail & Related papers (2025-06-26T05:28:57Z)
Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation [18.73832646369506]
We propose a novel cross-attention mechanism to encode the scene context for affordance prediction in 2D scenes.<n>First, we sample a probable location for a person within the scene using a variational autoencoder conditioned on the global scene context encoding.<n>Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding.
arXiv Detail & Related papers (2025-02-19T11:24:45Z)
DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation [0.13124513975412253]
We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models. Our approach begins by translating images into detailed textual descriptions using a captioning model. These descriptions are then used to produce new test images through a text-to-image diffusion process.
arXiv Detail & Related papers (2025-02-05T16:35:42Z)
Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding [1.0445560141983634]
We propose a novel image-based semantic embedding that extracts contextual information directly from visual features.<n>Our method achieves performance comparable to state-of-the-art models while addressing the shortcomings of CLIP embeddings in handling outdoor scenes.
arXiv Detail & Related papers (2025-02-01T15:37:22Z)
Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.<n> AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques. Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z)
Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption. SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z)
Aligning and Prompting Everything All at Once for Universal Visual Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks. APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection. Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z)
GaFET: Learning Geometry-aware Facial Expression Translation from In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression. We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z)
LAMP: Leveraging Language Prompts for Multi-person Pose Estimation [8.983326069321981]
We propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation) By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels. This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation.
arXiv Detail & Related papers (2023-07-21T23:00:43Z)
MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z)
Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.