TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
- URL: http://arxiv.org/abs/2503.19901v2
- Date: Thu, 03 Apr 2025 15:28:08 GMT
- Title: TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
- Authors: Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, Jingbo Wang,
- Abstract summary: TokenHSI is a transformer-based policy capable of multi-skill unification and flexible adaptation.<n>Key insight is to model the humanoid proprioception as a separate shared token.<n>Our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios.
- Score: 41.224062790263375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/
Related papers
- Geometrically-Aware One-Shot Skill Transfer of Category-Level Objects [18.978751760636563]
We propose a new skill transfer framework, which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration.
Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions.
We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.
arXiv Detail & Related papers (2025-03-19T16:10:17Z) - Integrating Controllable Motion Skills from Demonstrations [30.943279225315308]
We introduce a flexible multi-skill integration framework named Controllable Skills Integration (CSI)
CSI enables the integration of a diverse set of motion skills with varying styles into a single policy without the need for complex reward tuning.
Our experiments demonstrate that CSI can flexibly integrate a diverse array of motion skills more comprehensively and facilitate the transitions between different skills.
arXiv Detail & Related papers (2024-08-06T08:01:02Z) - Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning [61.294110816231886]
We introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP)
SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model.
Demos and codes can be found in https://forrest-110.io/sparse_diffusion_policy/.
arXiv Detail & Related papers (2024-07-01T17:59:56Z) - One-shot Imitation in a Non-Stationary Environment via Multi-Modal Skill [6.294766893350108]
We present a skill-based imitation learning framework enabling one-shot imitation and zero-shot adaptation.
We leverage a vision-language model to learn a semantic skill set from offline video datasets.
We evaluate our framework with various one-shot imitation scenarios for extended multi-stage Meta-world tasks.
arXiv Detail & Related papers (2024-02-13T11:01:52Z) - Unified Human-Scene Interaction via Prompted Chain-of-Contacts [61.87652569413429]
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality.
This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands.
arXiv Detail & Related papers (2023-09-14T17:59:49Z) - MetaModulation: Learning Variational Feature Hierarchies for Few-Shot
Learning with Fewer Tasks [63.016244188951696]
We propose a method for few-shot learning with fewer tasks, which is by metaulation.
We modify parameters at various batch levels to increase the meta-training tasks.
We also introduce learning variational feature hierarchies by incorporating the variationalulation.
arXiv Detail & Related papers (2023-05-17T15:47:47Z) - A Transformer Framework for Data Fusion and Multi-Task Learning in Smart
Cities [99.56635097352628]
This paper proposes a Transformer-based AI system for emerging smart cities.
It supports virtually any input data and output task types present S&CCs.
It is demonstrated through learning diverse task sets representative of S&CC environments.
arXiv Detail & Related papers (2022-11-18T20:43:09Z) - Active Task Randomization: Learning Robust Skills via Unsupervised
Generation of Diverse and Feasible Tasks [37.73239471412444]
We introduce Active Task Randomization (ATR), an approach that learns robust skills through the unsupervised generation of training tasks.
ATR selects suitable tasks, which consist of an initial environment state and manipulation goal, for learning robust skills by balancing the diversity and feasibility of the tasks.
We demonstrate that the learned skills can be composed by a task planner to solve unseen sequential manipulation problems based on visual inputs.
arXiv Detail & Related papers (2022-11-11T11:24:55Z) - Modular Adaptive Policy Selection for Multi-Task Imitation Learning
through Task Division [60.232542918414985]
Multi-task learning often suffers from negative transfer, sharing information that should be task-specific.
This is done by using proto-policies as modules to divide the tasks into simple sub-behaviours that can be shared.
We also demonstrate its ability to autonomously divide the tasks into both shared and task-specific sub-behaviours.
arXiv Detail & Related papers (2022-03-28T15:53:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.