LAMP: Leveraging Language Prompts for Multi-person Pose Estimation
- URL: http://arxiv.org/abs/2307.11934v2
- Date: Wed, 26 Jul 2023 18:08:10 GMT
- Title: LAMP: Leveraging Language Prompts for Multi-person Pose Estimation
- Authors: Shengnan Hu, Ce Zheng, Zixiang Zhou, Chen Chen, and Gita Sukthankar
- Abstract summary: We propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation)
By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels.
This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation.
- Score: 8.983326069321981
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human-centric visual understanding is an important desideratum for effective
human-robot interaction. In order to navigate crowded public places, social
robots must be able to interpret the activity of the surrounding humans. This
paper addresses one key aspect of human-centric visual understanding,
multi-person pose estimation. Achieving good performance on multi-person pose
estimation in crowded scenes is difficult due to the challenges of occluded
joints and instance separation. In order to tackle these challenges and
overcome the limitations of image features in representing invisible body
parts, we propose a novel prompt-based pose inference strategy called LAMP
(Language Assisted Multi-person Pose estimation). By utilizing the text
representations generated by a well-trained language model (CLIP), LAMP can
facilitate the understanding of poses on the instance and joint levels, and
learn more robust visual representations that are less susceptible to
occlusion. This paper demonstrates that language-supervised training boosts the
performance of single-stage multi-person pose estimation, and both
instance-level and joint-level prompts are valuable for training. The code is
available at https://github.com/shengnanh20/LAMP.
Related papers
- Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP.
To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES.
PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors [36.75629570208193]
Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image.
In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels.
We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model.
Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
arXiv Detail & Related papers (2023-03-09T19:08:02Z) - Word level Bangla Sign Language Dataset for Continuous BSL Recognition [0.0]
We develop an attention-based Bi-GRU model that captures the temporal dynamics of pose information for individuals communicating through sign language.
The accuracy of the model is reported to be 85.64%.
arXiv Detail & Related papers (2023-02-22T18:55:54Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - CLAMP: Prompt-based Contrastive Learning for Connecting Language and
Animal Pose [70.59906971581192]
We introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose effectively.
The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training.
Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings.
arXiv Detail & Related papers (2022-06-23T14:51:42Z) - Differentiable Multi-Granularity Human Representation Learning for
Instance-Aware Human Semantic Parsing [131.97475877877608]
A new bottom-up regime is proposed to learn category-level human semantic segmentation and multi-person pose estimation in a joint and end-to-end manner.
It is a compact, efficient and powerful framework that exploits structural information over different human granularities.
Experiments on three instance-aware human datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
arXiv Detail & Related papers (2021-03-08T06:55:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.