DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement
Estimation in Conversation
- URL: http://arxiv.org/abs/2308.01966v1
- Date: Mon, 31 Jul 2023 06:02:35 GMT
- Title: DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement
Estimation in Conversation
- Authors: Vu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang, M. Zaigham Zaheer, Shah
Nawaz, Karthik Nandakumar, Soo-Hyung Kim
- Abstract summary: We introduce a convolutional Transformer for modeling and estimating human engagement.
Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$% improvement on test set.
We employ different modality fusion mechanism and show that for this type of data, a simpled method with self-attention fusion gains the best performance.
- Score: 11.185293979235547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conversational engagement estimation is posed as a regression problem,
entailing the identification of the favorable attention and involvement of the
participants in the conversation. This task arises as a crucial pursuit to gain
insights into human's interaction dynamics and behavior patterns within a
conversation. In this research, we introduce a dilated convolutional
Transformer for modeling and estimating human engagement in the MULTIMEDIATE
2023 competition. Our proposed system surpasses the baseline models, exhibiting
a noteworthy $7$\% improvement on test set and $4$\% on validation set.
Moreover, we employ different modality fusion mechanism and show that for this
type of data, a simple concatenated method with self-attention fusion gains the
best performance.
Related papers
- MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models.
We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z) - DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation [42.87704953679693]
Engagement estimation plays a crucial role in understanding human social behaviors.
We propose a Dialogue-Aware Transformer framework that relies solely on audio-visual input and is language-independent.
Our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.
arXiv Detail & Related papers (2024-10-11T02:43:45Z) - Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models.
The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z) - Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition.
In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities.
Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - Joint-Relation Transformer for Multi-Person Motion Prediction [79.08243886832601]
We propose the Joint-Relation Transformer to enhance interaction modeling.
Our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and 17.8%/12.0% improvement of 3s MPJPE.
arXiv Detail & Related papers (2023-08-09T09:02:47Z) - Emotional Reaction Intensity Estimation Based on Multimodal Data [24.353102762289545]
This paper introduces our method for the Emotional Reaction Intensity (ERI) Estimation Challenge.
Based on the multimodal data provided by the originazers, we extract acoustic and visual features with different pretrained models.
arXiv Detail & Related papers (2023-03-16T09:14:47Z) - A Probabilistic Model Of Interaction Dynamics for Dyadic Face-to-Face
Settings [1.9544213396776275]
We develop a probabilistic model to capture the interaction dynamics between pairs of participants in a face-to-face setting.
This interaction encoding is then used to influence the generation when predicting one agent's future dynamics.
We show that our model successfully delineates between the modes, based on their interacting dynamics.
arXiv Detail & Related papers (2022-07-10T23:31:27Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Hybrid Supervised Reinforced Model for Dialogue Systems [2.1485350418225244]
The model copes with both tasks required for Dialogue Management: State Tracking and Decision Making.
The model achieves greater performance, learning speed and robustness than a non-recurrent baseline.
arXiv Detail & Related papers (2020-11-04T12:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.