Multimodal Conversational AI: A Survey of Datasets and Approaches
- URL: http://arxiv.org/abs/2205.06907v1
- Date: Fri, 13 May 2022 21:51:42 GMT
- Title: Multimodal Conversational AI: A Survey of Datasets and Approaches
- Authors: Anirudh Sundar and Larry Heck
- Abstract summary: A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities.
This paper motivates, defines, and mathematically formulates the multimodal conversational research objective.
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As humans, we experience the world with all our senses or modalities (sound,
sight, touch, smell, and taste). We use these modalities, particularly sight
and touch, to convey and interpret specific meanings. Multimodal expressions
are central to conversations; a rich set of modalities amplify and often
compensate for each other. A multimodal conversational AI system answers
questions, fulfills tasks, and emulates human conversations by understanding
and expressing itself via multiple modalities. This paper motivates, defines,
and mathematically formulates the multimodal conversational research objective.
We provide a taxonomy of research required to solve the objective: multimodal
representation, fusion, alignment, translation, and co-learning. We survey
state-of-the-art datasets and approaches for each research area and highlight
their limiting assumptions. Finally, we identify multimodal co-learning as a
promising direction for multimodal conversational AI research.
Related papers
- Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Modality Influence in Multimodal Machine Learning [0.0]
The study examines Multimodal Sentiment Analysis, Multimodal Emotion Recognition, Multimodal Hate Speech Recognition, and Multimodal Disease Detection.
The research aims to identify the most influential modality or set of modalities for each task and draw conclusions for diverse multimodal classification tasks.
arXiv Detail & Related papers (2023-06-10T16:28:52Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - On the Linguistic and Computational Requirements for Creating
Face-to-Face Multimodal Human-Machine Interaction [0.0]
We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events.
The data show evidence that double-loop feedback is established during a face-to-face conversation.
We propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions.
arXiv Detail & Related papers (2022-11-24T21:17:36Z) - Multilingual Multimodality: A Taxonomical Survey of Datasets,
Techniques, Challenges and Opportunities [10.721189858694396]
We study the unification of multilingual and multimodal (MultiX) streams.
We review the languages studied, gold or silver data with parallel annotations, and understand how these modalities and languages interact in modeling.
We present an account of the modeling approaches along with their strengths and weaknesses to better understand what scenarios they can be used reliably.
arXiv Detail & Related papers (2022-10-30T21:46:01Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions [68.6358773622615]
This paper provides an overview of the computational and theoretical foundations of multimodal machine learning.
We propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification.
Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches.
arXiv Detail & Related papers (2022-09-07T19:21:19Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - Masking Orchestration: Multi-task Pretraining for Multi-role Dialogue
Representation Learning [50.5572111079898]
Multi-role dialogue understanding comprises a wide range of diverse tasks such as question answering, act classification, dialogue summarization etc.
While dialogue corpora are abundantly available, labeled data, for specific learning tasks, can be highly scarce and expensive.
In this work, we investigate dialogue context representation learning with various types unsupervised pretraining tasks.
arXiv Detail & Related papers (2020-02-27T04:36:52Z) - Detecting depression in dyadic conversations with multimodal narratives
and visualizations [1.4824891788575418]
In this paper, we develop a system that supports humans to analyze conversations.
We demonstrate the ability of our system to take in a wide range of multimodal information and automatically generated a prediction score for the depression state of the individual.
arXiv Detail & Related papers (2020-01-13T10:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.