Related papers: PoseScript: Linking 3D Human Poses and Natural Language

PoseScript: Linking 3D Human Poses and Natural Language

URL: http://arxiv.org/abs/2210.11795v2
Date: Fri, 19 Jan 2024 14:08:38 GMT
Title: PoseScript: Linking 3D Human Poses and Natural Language
Authors: Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Gr\'egory Rogez
Abstract summary: We introduce the PoseScript dataset, which pairs more than six thousand 3D human poses with rich human-annotated descriptions. To increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process. This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions.
Score: 33.325778872898866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset. This dataset pairs more than six thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. Additionally, to increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints. These posecodes are then combined into higher level textual descriptions using syntactic rules. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions. To showcase the potential of annotated poses, we present three multi-modal learning tasks that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps 3D poses and textual descriptions into a joint embedding space, allowing for cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we establish a baseline for a text-conditioned model generating 3D poses. Thirdly, we present a learned process for generating pose descriptions. These applications demonstrate the versatility and usefulness of annotated poses in various tasks and pave the way for future research in the field.

Related papers

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions [28.185661905201222]
Descrip3D is a novel framework that explicitly encodes the relationships between objects using natural language.<n>It allows for unified reasoning across various tasks such as grounding, captioning, and question answering.
arXiv Detail & Related papers (2025-07-19T09:19:16Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene. Existing approaches commonly encounter a shortage of text3D pairs available for training. We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding [0.5755004576310334]
A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks.
arXiv Detail & Related papers (2024-12-24T14:21:58Z)
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation [38.958695275774616]
We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation.
arXiv Detail & Related papers (2024-09-10T14:09:39Z)
Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs [74.98581417902201]
We propose a novel framework to generate compositional 3D scenes from scene graphs. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer.
arXiv Detail & Related papers (2023-11-30T18:59:58Z)
ChatPose: Chatting about 3D Human Pose [47.70287492050979]
ChatPose is a framework to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description.
arXiv Detail & Related papers (2023-11-30T18:59:52Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks. We propose a novel Siamese denoising autoencoder to learn a 3D pose representation. Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.