Body of Her: A Preliminary Study on End-to-End Humanoid Agent
- URL: http://arxiv.org/abs/2408.02879v1
- Date: Tue, 6 Aug 2024 01:13:09 GMT
- Title: Body of Her: A Preliminary Study on End-to-End Humanoid Agent
- Authors: Tenglong Ao,
- Abstract summary: We propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors.
This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.
- Score: 0.8702432681310401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.
Related papers
- OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models [25.45077656291886]
We propose a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase.
These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation.
Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs.
arXiv Detail & Related papers (2025-02-03T05:17:32Z) - Massively Multi-Person 3D Human Motion Forecasting with Scene Context [13.197408989895102]
We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion.
We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information.
Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study.
arXiv Detail & Related papers (2024-09-18T17:58:51Z) - Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs.
We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z) - Object Motion Guided Human Motion Synthesis [22.08240141115053]
We study the problem of full-body human motion synthesis for the manipulation of large-sized objects.
We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework.
We develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated.
arXiv Detail & Related papers (2023-09-28T08:22:00Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - Full-Body Articulated Human-Object Interaction [61.01135739641217]
CHAIRS is a large-scale motion-captured f-AHOI dataset consisting of 16.2 hours of versatile interactions.
CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process.
By learning the geometrical relationships in HOI, we devise the very first model that leverage human pose estimation.
arXiv Detail & Related papers (2022-12-20T19:50:54Z) - Multi-Modal Open-Domain Dialogue [28.69395893943413]
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling.
We investigate combining components from state-of-the-art open-domain dialogue agents with those from state-of-the-art vision models.
We show that our best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor.
arXiv Detail & Related papers (2020-10-02T16:20:39Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.