Related papers: MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans

MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans

URL: http://arxiv.org/abs/2410.00253v1
Date: Mon, 30 Sep 2024 21:51:30 GMT
Title: MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
Authors: Anna Deichler, Jim O'Regan, Jonas Beskow,
Abstract summary: We present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR) Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings.
Score: 4.098892268127572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.

Related papers

MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [49.45034796115852]
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment. Current datasets fall short in scale, realism and do not capture the nature of OR scenes, limiting multimodal in OR modeling. We introduce MM-OR, a realistic and large-scale multimodal OR dataset, and first dataset to enable multimodal scene graph generation.
arXiv Detail & Related papers (2025-03-04T13:00:52Z)
Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km. The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z)
Towards Open Domain Text-Driven Synthesis of Multi-Person Motions [36.737740727883924]
We curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
arXiv Detail & Related papers (2024-05-28T18:00:06Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model. We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z)
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation [8.149870655785955]
OmniDataComposer is an innovative approach for multimodal data fusion and unlimited data generation. It is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction.
arXiv Detail & Related papers (2023-08-08T08:30:16Z)
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)
SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations [9.626560177660634]
We present a new corpus for the Situated and Interactive Multimodal Conversations, SIMMC 2.0, aimed at building a successful multimodal assistant agent. The dataset features 11K task-oriented dialogs (117K utterances) between a user and a virtual assistant on the shopping domain.
arXiv Detail & Related papers (2021-04-18T00:14:29Z)
Situated and Interactive Multimodal Conversations [21.391260370502224]
We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents. We provide two SIMMC datasets totalling 13K human-human dialogs (169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup. We present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation.
arXiv Detail & Related papers (2020-06-02T09:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.