Related papers: DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

URL: http://arxiv.org/abs/2308.01966v1
Date: Mon, 31 Jul 2023 06:02:35 GMT
Title: DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation
Authors: Vu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang, M. Zaigham Zaheer, Shah Nawaz, Karthik Nandakumar, Soo-Hyung Kim
Abstract summary: We introduce a convolutional Transformer for modeling and estimating human engagement. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$% improvement on test set. We employ different modality fusion mechanism and show that for this type of data, a simpled method with self-attention fusion gains the best performance.
Score: 11.185293979235547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.

Related papers

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification [0.0]
multimodal sentiment analysis model integrates text, audio, and visual data to enhance sentiment classification. Study evaluates three feature fusion strategies -- late stage fusion, early stage fusion, and multi-headed attention. Findings suggest that integrating modalities early in the process enhances sentiment classification, while attention mechanisms may have limited impact within the current framework.
arXiv Detail & Related papers (2025-01-14T12:54:19Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation [42.87704953679693]
Engagement estimation plays a crucial role in understanding human social behaviors. We propose a Dialogue-Aware Transformer framework that relies solely on audio-visual input and is language-independent. Our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.
arXiv Detail & Related papers (2024-10-11T02:43:45Z)
Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition. In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities. Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging) It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z)
Joint-Relation Transformer for Multi-Person Motion Prediction [79.08243886832601]
We propose the Joint-Relation Transformer to enhance interaction modeling. Our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and 17.8%/12.0% improvement of 3s MPJPE.
arXiv Detail & Related papers (2023-08-09T09:02:47Z)
A Low-rank Matching Attention based Cross-modal Feature Fusion Method for Conversational Emotion Recognition [54.44337276044968]
We introduce a novel and lightweight cross-modal feature fusion method called Low-Rank Matching Attention Method (LMAM) LMAM effectively captures contextual emotional semantic information in conversations while mitigating the quadratic complexity issue caused by the self-attention mechanism. Experimental results verify the superiority of LMAM compared with other popular cross-modal fusion methods on the premise of being more lightweight.
arXiv Detail & Related papers (2023-06-16T16:02:44Z)
Emotional Reaction Intensity Estimation Based on Multimodal Data [24.353102762289545]
This paper introduces our method for the Emotional Reaction Intensity (ERI) Estimation Challenge. Based on the multimodal data provided by the originazers, we extract acoustic and visual features with different pretrained models.
arXiv Detail & Related papers (2023-03-16T09:14:47Z)
A Probabilistic Model Of Interaction Dynamics for Dyadic Face-to-Face Settings [1.9544213396776275]
We develop a probabilistic model to capture the interaction dynamics between pairs of participants in a face-to-face setting. This interaction encoding is then used to influence the generation when predicting one agent's future dynamics. We show that our model successfully delineates between the modes, based on their interacting dynamics.
arXiv Detail & Related papers (2022-07-10T23:31:27Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
Hybrid Supervised Reinforced Model for Dialogue Systems [2.1485350418225244]
The model copes with both tasks required for Dialogue Management: State Tracking and Decision Making. The model achieves greater performance, learning speed and robustness than a non-recurrent baseline.
arXiv Detail & Related papers (2020-11-04T12:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.