SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
- URL: http://arxiv.org/abs/2412.00174v1
- Date: Fri, 29 Nov 2024 18:53:40 GMT
- Title: SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
- Authors: Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu,
- Abstract summary: We introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters.
We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction.
We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets.
- Score: 38.90959051732146
- License:
- Abstract: Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
Related papers
- Social Conjuring: Multi-User Runtime Collaboration with AI in Building Virtual 3D Worlds [3.5152339192019113]
Social Conjurer is a framework for AI-augmented dynamic 3D scene co-creation.
This article presents a set of implications for designing human-centered interfaces that incorporate AI models into 3D content generation.
arXiv Detail & Related papers (2024-09-30T23:02:51Z) - Towards Embedding Dynamic Personas in Interactive Robots: Masquerading Animated Social Kinematics (MASK) [10.351714893090964]
This paper presents the design and development of an innovative interactive robotic system to enhance audience engagement using character-like personas.
Built upon the foundations of persona-driven dialog agents, this work extends the agent's application to the physical realm, employing robots to provide a more captivating and interactive experience.
arXiv Detail & Related papers (2024-03-15T06:22:32Z) - Digital Life Project: Autonomous 3D Characters with Social Intelligence [86.2845109451914]
Digital Life Project is a framework utilizing language as the universal medium to build autonomous 3D characters.
Our framework comprises two primary components: SocioMind and MoMat-MoGen.
arXiv Detail & Related papers (2023-12-07T18:58:59Z) - GenZI: Zero-Shot 3D Human-Scene Interaction Generation [39.9039943099911]
We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions.
Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions.
In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data.
arXiv Detail & Related papers (2023-11-29T15:40:11Z) - ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions [66.87211993793807]
We present ReMoS, a denoising diffusion based model that synthesizes full body motion of a person in two person interaction scenario.
We demonstrate ReMoS across challenging two person scenarios such as pair dancing, Ninjutsu, kickboxing, and acrobatics.
We also contribute the ReMoCap dataset for two person interactions containing full body and finger motions.
arXiv Detail & Related papers (2023-11-28T18:59:52Z) - Tachikuma: Understading Complex Interactions with Multi-Character and
Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset.
The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.
We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z) - Triangular Character Animation Sampling with Motion, Emotion, and
Relation [78.80083186208712]
We present a novel framework to sample and synthesize animations by associating the characters' body motions, facial expressions, and social relations.
Our method can provide animators with an automatic way to generate 3D character animations, help synthesize interactions between Non-Player Characters (NPCs) and enhance machine emotion intelligence in virtual reality (VR)
arXiv Detail & Related papers (2022-03-09T18:19:03Z) - PHASE: PHysically-grounded Abstract Social Events for Machine Social
Perception [50.551003004553806]
We create a dataset of physically-grounded abstract social events, PHASE, that resemble a wide range of real-life social interactions.
Phase is validated with human experiments demonstrating that humans perceive rich interactions in the social events.
As a baseline model, we introduce a Bayesian inverse planning approach, SIMPLE, which outperforms state-of-the-art feed-forward neural networks.
arXiv Detail & Related papers (2021-03-02T18:44:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.