Natural-language-driven Simulation Benchmark and Copilot for Efficient
Production of Object Interactions in Virtual Road Scenes
- URL: http://arxiv.org/abs/2312.04008v4
- Date: Fri, 15 Dec 2023 12:06:36 GMT
- Title: Natural-language-driven Simulation Benchmark and Copilot for Efficient
Production of Object Interactions in Virtual Road Scenes
- Authors: Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Die Zuo, Jibin
Peng, Zhao Huang, Zhecheng Xu, Fupeng Li, Ziyun Bai, Di Lin
- Abstract summary: We advocate the idea of the natural-language-driven(NLD) simulation to efficiently produce the object interactions between multiple objects in the virtual road scenes.
We collect the Language-to-Interaction(L2I) benchmark dataset with 120,000 natural-language descriptions of object interactions in 6 common types of road topologies.
As a methodology contribution, we design SimCopilot to translate the interaction descriptions to the renderable code.
- Score: 8.303084278117861
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We advocate the idea of the natural-language-driven(NLD) simulation to
efficiently produce the object interactions between multiple objects in the
virtual road scenes, for teaching and testing the autonomous driving systems
that should take quick action to avoid collision with obstacles with
unpredictable motions. The NLD simulation allows the brief natural-language
description to control the object interactions, significantly reducing the
human efforts for creating a large amount of interaction data. To facilitate
the research of NLD simulation, we collect the Language-to-Interaction(L2I)
benchmark dataset with 120,000 natural-language descriptions of object
interactions in 6 common types of road topologies. Each description is
associated with the programming code, which the graphic render can use to
visually reconstruct the object interactions in the virtual scenes. As a
methodology contribution, we design SimCopilot to translate the interaction
descriptions to the renderable code. We use the L2I dataset to evaluate
SimCopilot's abilities to control the object motions, generate complex
interactions, and generalize interactions across road topologies. The L2I
dataset and the evaluation results motivate the relevant research of the NLD
simulation.
Related papers
- LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control [45.1230495980299]
We extend the interactive object reconstruction from single object level to complex scene level.
We propose LiveScene, the first scene-level language-embedded interactive neural radiance field.
LiveScene efficiently reconstructs and controls multiple interactive objects in complex scenes.
arXiv Detail & Related papers (2024-06-23T07:26:13Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
This study focuses on the application of Multimodal Large Language Models (MLLMs) within the domain of autonomous driving.
We evaluate the capability of various MLLMs as world models for driving from the perspective of a fixed in-car camera.
Our results highlight a critical gap in the current capabilities of state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in
Clutter [14.489086924126253]
This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes.
Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes.
We propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn synthesis grasp directly from image-text pairs.
arXiv Detail & Related papers (2023-11-09T22:55:10Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Learning Sim-to-Real Dense Object Descriptors for Robotic Manipulation [4.7246285569677315]
We present Sim-to-Real Dense Object Nets (SRDONs), a dense object descriptor that not only understands the object via appropriate representation but also maps simulated and real data to a unified feature space with pixel consistency.
We demonstrate in experiments that pre-trained SRDONs significantly improve performances on unseen objects and unseen visual environments for various robotic tasks with zero real-world training.
arXiv Detail & Related papers (2023-04-18T02:28:55Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - DriveGAN: Towards a Controllable High-Quality Neural Simulation [147.6822288981004]
We introduce a novel high-quality neural simulator referred to as DriveGAN.
DriveGAN achieves controllability by disentangling different components without supervision.
We train DriveGAN on multiple datasets, including 160 hours of real-world driving data.
arXiv Detail & Related papers (2021-04-30T15:30:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.