On the Linguistic and Computational Requirements for Creating
Face-to-Face Multimodal Human-Machine Interaction
- URL: http://arxiv.org/abs/2211.13804v1
- Date: Thu, 24 Nov 2022 21:17:36 GMT
- Title: On the Linguistic and Computational Requirements for Creating
Face-to-Face Multimodal Human-Machine Interaction
- Authors: Jo\~ao Ranhel and Cacilda Vilela de Lima
- Abstract summary: We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events.
The data show evidence that double-loop feedback is established during a face-to-face conversation.
We propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, conversations between humans and avatars are linguistically,
organizationally, and structurally analyzed, focusing on what is necessary for
creating face-to-face multimodal interfaces for machines. We videorecorded
thirty-four human-avatar interactions, performed complete linguistic
microanalysis on video excerpts, and marked all the occurrences of multimodal
actions and events. Statistical inferences were applied to data, allowing us to
comprehend not only how often multimodal actions occur but also how multimodal
events are distributed between the speaker (emitter) and the listener
(recipient). We also observed the distribution of multimodal occurrences for
each modality. The data show evidence that double-loop feedback is established
during a face-to-face conversation. This led us to propose that knowledge from
Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among
others, should be incorporated into the ones used for describing human-machine
multimodal interactions. Face-to-face interfaces require an additional control
layer to the multimodal fusion layer. This layer has to organize the flow of
conversation, integrate the social context into the interaction, as well as
make plans concerning 'what' and 'how' to progress on the interaction. This
higher level is best understood if we incorporate insights from CA and ToM into
the interface system.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - Conversation Understanding using Relational Temporal Graph Neural
Networks with Auxiliary Cross-Modality Interaction [2.1261712640167856]
Emotion recognition is a crucial task for human conversation understanding.
We propose an input Temporal Graph Neural Network with Cross-Modality Interaction (CORECT)
CORECT effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies.
arXiv Detail & Related papers (2023-11-08T07:46:25Z) - Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps.
We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration.
Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - InterMulti:Multi-view Multimodal Interactions with Text-dominated
Hierarchical High-order Fusion for Emotion Analysis [10.048903012988882]
We propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views.
Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations.
THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation.
arXiv Detail & Related papers (2022-12-20T07:02:32Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Multimodal Conversational AI: A Survey of Datasets and Approaches [0.76146285961466]
A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities.
This paper motivates, defines, and mathematically formulates the multimodal conversational research objective.
arXiv Detail & Related papers (2022-05-13T21:51:42Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in
Conversations [5.5997926295092295]
multimodal Emotion Recognition in Conversations (ERC) has considerable prospects for developing empathetic machines.
Recent graph-based fusion methods aggregate multimodal information by exploring unimodal and cross-modal interactions in a graph.
We propose a novel Multimodal Dynamic Fusion Network (MM-DFN) to recognize emotions by fully understanding multimodal conversational context.
arXiv Detail & Related papers (2022-03-04T15:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.