Related papers: Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

URL: http://arxiv.org/abs/2507.19232v1
Date: Fri, 25 Jul 2025 12:57:05 GMT
Title: Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene
Authors: Donggeun Lim, Jinseok Bae, Inwoo Hwang, Seungmin Lee, Hwanhee Lee, Young Min Kim,
Abstract summary: We propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans.<n>We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input.<n>We employ a high-level module to deliver scalable yet comprehensive context.
Score: 13.70771642812974
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability. The code and benchmark, along with result videos, are available at our project page: https://rms0329.github.io/Event-Driven-Storytelling/.

Related papers

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras [6.174442475414146]
We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception.<n>We provide over 30,000 validated referring expressions, each enriched with four grounding attributes.<n>We propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations.
arXiv Detail & Related papers (2025-07-23T16:29:52Z)
SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control [36.22743674288336]
SCENIC is a diffusion model designed to generate human motion that adapts to dynamic terrains within virtual scenes.<n>Our system achieves seamless transitions between different motion styles while maintaining scene constraints.<n>Our code, dataset, and models will be released at urlhttps://virtualhumans.mpi-inf.mpg.de/scenic/.
arXiv Detail & Related papers (2024-12-20T08:25:15Z)
SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation [38.96874874208242]
We introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy.<n>Specifically, we employ Large Language Models with Retrieval-Augmented Generation to generate coherent and diverse long-form scripts.<n>A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues.
arXiv Detail & Related papers (2024-11-29T18:36:15Z)
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects [86.86284624825356]
HIMO is a dataset of full-body human interacting with multiple objects. HIMO contains 3.3K 4D HOI sequences and 4.08M 3D HOI frames.
arXiv Detail & Related papers (2024-07-17T07:47:34Z)
Generating Human Motion in 3D Scenes from Text Descriptions [60.04976442328767]
This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. We propose a new approach that decomposes the complex problem into two more manageable sub-problems. For language grounding of the target object, we leverage the power of large language models; for motion generation, we design an object-centric scene representation.
arXiv Detail & Related papers (2024-05-13T14:30:12Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
Revisit Human-Scene Interaction via Space Occupancy [55.67657438543008]
Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database.
arXiv Detail & Related papers (2023-12-05T12:03:00Z)
AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism [24.049207982022214]
We propose textbftT2M, a two-stage method with multi-perspective attention mechanism. Our method outperforms the current state-of-the-art in terms of qualitative and quantitative evaluation.
arXiv Detail & Related papers (2023-09-02T02:18:17Z)
Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations. Our method generates continuous motions that are parameterized only by the temporal coordinate. This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z)
TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data. We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.