ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning
- URL: http://arxiv.org/abs/2405.04533v1
- Date: Tue, 7 May 2024 17:59:31 GMT
- Title: ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning
- Authors: Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black,
- Abstract summary: ChatHuman is a language-driven human understanding system.
It combines and integrates the skills of many different methods.
ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.
- Score: 57.29285473727107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. Each of these methods works in isolation instead of synergistically. Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans. The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding. Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.
Related papers
- DiverseDialogue: A Methodology for Designing Chatbots with Human-Like Diversity [5.388338680646657]
We show that GPT-4o mini, when used as simulated human participants, systematically differ from those between actual humans across multiple linguistic features.
We propose an approach that automatically generates prompts for user simulations by incorporating features derived from real human interactions.
Our method of prompt optimization, tailored to target specific linguistic features, shows significant improvements.
arXiv Detail & Related papers (2024-08-30T21:33:58Z) - Maia: A Real-time Non-Verbal Chat for Human-AI Interaction [11.558827428811385]
We propose an alternative to text chats for Human-AI interaction, using facial expressions and head movements that mirror, but also improvise over the human user.
Our goal is to track and analyze facial expressions, and other non-verbal cues in real-time, and use this information to build models that can predict and understand human behavior.
arXiv Detail & Related papers (2024-02-09T13:07:22Z) - Primitive-based 3D Human-Object Interaction Modelling and Programming [59.47308081630886]
We propose a novel 3D geometric primitive-based language to encode both humans and objects.
We build a new benchmark on 3D HAOI consisting of primitives together with their images.
We believe this primitive-based 3D HAOI representation would pave the way for 3D HAOI studies.
arXiv Detail & Related papers (2023-12-17T13:16:49Z) - Real-time Addressee Estimation: Deployment of a Deep-Learning Model on
the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans.
Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot.
The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z) - HODN: Disentangling Human-Object Feature for HOI Detection [51.48164941412871]
We propose a Human and Object Disentangling Network (HODN) to model the Human-Object Interaction (HOI) relationships explicitly.
Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions.
Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det Linking datasets.
arXiv Detail & Related papers (2023-08-20T04:12:50Z) - Deep Learning for Human Parsing: A Survey [54.812353922568995]
We provide an analysis of state-of-the-art human parsing methods, covering a broad spectrum of pioneering works for semantic human parsing.
We introduce five insightful categories: (1) structure-driven architectures exploit the relationship of different human parts and the inherent hierarchical structure of a human body, (2) graph-based networks capture the global information to achieve an efficient and complete human body analysis, (3) context-aware networks explore useful contexts across all pixel to characterize a pixel of the corresponding class, and (4) LSTM-based methods can combine short-distance and long-distance spatial dependencies to better exploit abundant local and global contexts.
arXiv Detail & Related papers (2023-01-29T10:54:56Z) - iCub! Do you recognize what I am doing?: multimodal human action
recognition on multisensory-enabled iCub robot [0.0]
We show that the proposed multimodal ensemble learning leverages complementary characteristics of three color cameras and one depth sensor.
The results indicate that the proposed models can be deployed on the iCub robot that requires multimodal action recognition.
arXiv Detail & Related papers (2022-12-17T12:40:54Z) - Reconstructing Action-Conditioned Human-Object Interactions Using
Commonsense Knowledge Priors [42.17542596399014]
We present a method for inferring diverse 3D models of human-object interactions from images.
Our method extracts high-level commonsense knowledge from large language models.
We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset.
arXiv Detail & Related papers (2022-09-06T13:32:55Z) - Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z) - Human-robot co-manipulation of extended objects: Data-driven models and
control from analysis of human-human dyads [2.7036498789349244]
We use data from human-human dyad experiments to determine motion intent which we use for a physical human-robot co-manipulation task.
We develop a deep neural network based on motion data from human-human trials to predict human intent based on past motion.
arXiv Detail & Related papers (2020-01-03T21:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.