Related papers: InterFusion: Text-Driven Generation of 3D Human-Object Interaction

InterFusion: Text-Driven Generation of 3D Human-Object Interaction

URL: http://arxiv.org/abs/2403.15612v2
Date: Tue, 16 Jul 2024 17:20:31 GMT
Title: InterFusion: Text-Driven Generation of 3D Human-Object Interaction
Authors: Sisi Dai, Wenhao Li, Haowen Sun, Haibin Huang, Chongyang Ma, Hui Huang, Kai Xu, Ruizhen Hu,
Abstract summary: We tackle the complex task of generating 3D human-object interactions (HOI) from textual descriptions in a zero-shot text-to-3D manner. We present InterFusion, a two-stage framework specifically designed for HOI generation. Our experimental results affirm that InterFusion significantly outperforms existing state-of-the-art methods in 3D HOI generation.
Score: 38.380079482331745
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this study, we tackle the complex task of generating 3D human-object interactions (HOI) from textual descriptions in a zero-shot text-to-3D manner. We identify and address two key challenges: the unsatisfactory outcomes of direct text-to-3D methods in HOI, largely due to the lack of paired text-interaction data, and the inherent difficulties in simultaneously generating multiple concepts with complex spatial relationships. To effectively address these issues, we present InterFusion, a two-stage framework specifically designed for HOI generation. InterFusion involves human pose estimations derived from text as geometric priors, which simplifies the text-to-3D conversion process and introduces additional constraints for accurate object generation. At the first stage, InterFusion extracts 3D human poses from a synthesized image dataset depicting a wide range of interactions, subsequently mapping these poses to interaction descriptions. The second stage of InterFusion capitalizes on the latest developments in text-to-3D generation, enabling the production of realistic and high-quality 3D HOI scenes. This is achieved through a local-global optimization process, where the generation of human body and object is optimized separately, and jointly refined with a global optimization of the entire scene, ensuring a seamless and contextually coherent integration. Our experimental results affirm that InterFusion significantly outperforms existing state-of-the-art methods in 3D HOI generation.

Related papers

TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction [43.61297194416115]
We propose a new Text-Instructed Generation and Refinement (TIGeR) framework to steer the object shape refinement and pose estimation.<n>We use a two-stage framework: a text-instructed prior generation and vision-guided refinement.<n>TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets.
arXiv Detail & Related papers (2025-06-01T10:56:16Z)
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing [36.29681929804816]
We propose a novel zero-shot 3D HOI generation framework without training on specific datasets.<n>We use a pre-trained 2D image diffusion model to parse unseen objects and extract contact points.<n>We then introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts.
arXiv Detail & Related papers (2025-05-30T07:53:55Z)
Syn3DTxt: Embedding 3D Cues for Scene Text Generation [5.3618336695132625]
We investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering.<n>Traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds.<n>We propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic.
arXiv Detail & Related papers (2025-05-24T02:53:24Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation. We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories [124.24041272390954]
Modeling hand-object interaction priors holds significant potential to advance robotic and embodied AI systems.<n>We introduce SIGHT, a novel task focused on generating realistic and physically plausible 3D hand-object interaction trajectories from a single image.<n>We propose SIGHT-Fusion, a novel diffusion-based image-text conditioned generative model that tackles this task by retrieving the most similar 3D object mesh from a database.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
Towards Semantic 3D Hand-Object Interaction Generation via Functional Text Guidance [9.630837159704004]
Hand-object interaction (HOI) is the fundamental link between human and environment. Despite advances in AI and robotics, capturing the semantics of functional grasping tasks remains a considerable challenge. We propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text.
arXiv Detail & Related papers (2025-02-28T07:42:54Z)
OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains [66.62502882481373]
Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. We propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness.
arXiv Detail & Related papers (2024-11-27T10:13:35Z)
BIFRÖST: 3D-Aware Image compositing with Language Instructions [27.484947109237964]
Bifr"ost is a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Bifr"ost addresses issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process.
arXiv Detail & Related papers (2024-10-24T18:35:12Z)
SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation [122.47961178994456]
SeMv-3D is a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation.<n>At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors.<n>We also present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis.
arXiv Detail & Related papers (2024-10-10T07:02:06Z)
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects. HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z)
Interactive3D: Create What You Want by Interactive 3D Generation [13.003964182554572]
We introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation.
arXiv Detail & Related papers (2024-04-25T11:06:57Z)
SERF: Fine-Grained Interactive 3D Segmentation and Editing with Radiance Fields [92.14328581392633]
We introduce a novel fine-grained interactive 3D segmentation and editing algorithm with radiance fields, which we refer to as SERF. Our method entails creating a neural mesh representation by integrating multi-view algorithms with pre-trained 2D models. Building upon this representation, we introduce a novel surface rendering technique that preserves local information and is robust to deformation.
arXiv Detail & Related papers (2023-12-26T02:50:42Z)
SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z)
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models [42.62823339416957]
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object.
arXiv Detail & Related papers (2023-12-11T17:41:17Z)
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery [84.67823511418334]
This paper presents 3D JOint contrastive learning with TRansformers framework for handling occluded 3D human mesh recovery. Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$&$3D aligned results.
arXiv Detail & Related papers (2023-07-31T02:58:58Z)
ATT3D: Amortized Text-to-3D Object Synthesis [78.96673650638365]
We amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately. Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooths between text for novel assets and simple animations.
arXiv Detail & Related papers (2023-06-06T17:59:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.