PoseScript: Linking 3D Human Poses and Natural Language
- URL: http://arxiv.org/abs/2210.11795v2
- Date: Fri, 19 Jan 2024 14:08:38 GMT
- Title: PoseScript: Linking 3D Human Poses and Natural Language
- Authors: Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc
Moreno-Noguer, Gr\'egory Rogez
- Abstract summary: We introduce the PoseScript dataset, which pairs more than six thousand 3D human poses with rich human-annotated descriptions.
To increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process.
This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints.
With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions.
- Score: 33.325778872898866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language plays a critical role in many computer vision applications,
such as image captioning, visual question answering, and cross-modal retrieval,
to provide fine-grained semantic information. Unfortunately, while human pose
is key to human understanding, current 3D human pose datasets lack detailed
language descriptions. To address this issue, we have introduced the PoseScript
dataset. This dataset pairs more than six thousand 3D human poses from AMASS
with rich human-annotated descriptions of the body parts and their spatial
relationships. Additionally, to increase the size of the dataset to a scale
that is compatible with data-hungry learning algorithms, we have proposed an
elaborate captioning process that generates automatic synthetic descriptions in
natural language from given 3D keypoints. This process extracts low-level pose
information, known as "posecodes", using a set of simple but generic rules on
the 3D keypoints. These posecodes are then combined into higher level textual
descriptions using syntactic rules. With automatic annotations, the amount of
available data significantly scales up (100k), making it possible to
effectively pretrain deep models for finetuning on human captions. To showcase
the potential of annotated poses, we present three multi-modal learning tasks
that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps
3D poses and textual descriptions into a joint embedding space, allowing for
cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we
establish a baseline for a text-conditioned model generating 3D poses. Thirdly,
we present a learned process for generating pose descriptions. These
applications demonstrate the versatility and usefulness of annotated poses in
various tasks and pave the way for future research in the field.
Related papers
- Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images [32.33170182669095]
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images.
The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads.
The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks.
arXiv Detail & Related papers (2024-01-17T18:51:53Z) - GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs.
By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model.
We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z) - ChatPose: Chatting about 3D Human Pose [47.70287492050979]
ChatPose is a framework to understand and reason about 3D human poses from images or textual descriptions.
Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Decanus to Legatus: Synthetic training for 2D-3D human pose lifting [26.108023246654646]
We propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus)
Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the potential of our framework.
arXiv Detail & Related papers (2022-10-05T13:10:19Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Unsupervised 3D Human Pose Representation with Viewpoint and Pose
Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks.
We propose a novel Siamese denoising autoencoder to learn a 3D pose representation.
Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.