MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
- URL: http://arxiv.org/abs/2410.00253v1
- Date: Mon, 30 Sep 2024 21:51:30 GMT
- Title: MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
- Authors: Anna Deichler, Jim O'Regan, Jonas Beskow,
- Abstract summary: We present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR)
Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings.
- Score: 4.098892268127572
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.
Related papers
- Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.
We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations.
With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - SIMS: Simulating Human-Scene Interactions with Real World Script Planning [33.31213669502036]
This paper introduces a novel framework for the planning and controlling of long-horizon physical plausible human-scene interaction.
Large Language Models (LLMs) can understand and generate logical storylines.
By leveraging this, we utilize a dual-aware policy that achieves both language comprehension and scene understanding.
arXiv Detail & Related papers (2024-11-29T18:36:15Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - Towards Open Domain Text-Driven Synthesis of Multi-Person Motions [36.737740727883924]
We curate human pose and motion datasets by estimating pose information from large-scale image and video datasets.
Our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
arXiv Detail & Related papers (2024-05-28T18:00:06Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model.
We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data.
We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z) - Accountable Textual-Visual Chat Learns to Reject Human Instructions in
Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K).
These datasets incorporate both visual and text-based inputs and outputs.
To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal
Conversations [9.626560177660634]
We present a new corpus for the Situated and Interactive Multimodal Conversations, SIMMC 2.0, aimed at building a successful multimodal assistant agent.
The dataset features 11K task-oriented dialogs (117K utterances) between a user and a virtual assistant on the shopping domain.
arXiv Detail & Related papers (2021-04-18T00:14:29Z) - Situated and Interactive Multimodal Conversations [21.391260370502224]
We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents.
We provide two SIMMC datasets totalling 13K human-human dialogs (169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup.
We present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation.
arXiv Detail & Related papers (2020-06-02T09:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.