DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation
- URL: http://arxiv.org/abs/2410.08470v1
- Date: Fri, 11 Oct 2024 02:43:45 GMT
- Title: DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation
- Authors: Jia Li, Yangchen Yu, Yin Chen, Yu Zhang, Peng Jia, Yunbo Xu, Ziqiang Li, Meng Wang, Richang Hong,
- Abstract summary: Engagement estimation plays a crucial role in understanding human social behaviors.
We propose a Dialogue-Aware Transformer framework that relies solely on audio-visual input and is language-independent.
Our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.
- Score: 42.87704953679693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate'24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.
Related papers
- A Framework for Adapting Human-Robot Interaction to Diverse User Groups [16.17512394063696]
We present a novel framework for adaptive Human-Robot Interaction (HRI)
Our primary contributions include the development of an adaptive, ROS-based HRI framework with an open-source code base.
This framework supports natural interactions through advanced speech recognition and voice activity detection.
arXiv Detail & Related papers (2024-10-15T08:16:43Z) - Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z) - Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition.
In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities.
Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement
Estimation in Conversation [11.185293979235547]
We introduce a convolutional Transformer for modeling and estimating human engagement.
Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$% improvement on test set.
We employ different modality fusion mechanism and show that for this type of data, a simpled method with self-attention fusion gains the best performance.
arXiv Detail & Related papers (2023-07-31T06:02:35Z) - Human-to-Human Interaction Detection [3.00604614803979]
We introduce a new task named human-to-human interaction detection (HID)
HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model.
First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I)
arXiv Detail & Related papers (2023-07-02T03:24:58Z) - Multipar-T: Multiparty-Transformer for Capturing Contingent Behaviors in
Group Conversations [25.305521223925428]
We propose the Multiparty-Transformer (Multipar-T), a transformer model for multiparty behavior modeling.
The core component of our proposed approach is the Crossperson Attention, which is specifically designed to detect contingent behavior between pairs of people.
We verify the effectiveness of Multipar-T on a publicly available video-based group engagement detection benchmark.
arXiv Detail & Related papers (2023-04-19T20:23:11Z) - Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems.
We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z) - Partner Matters! An Empirical Study on Fusing Personas for Personalized
Response Selection in Retrieval-Based Chatbots [51.091235903442715]
This paper makes an attempt to explore the impact of utilizing personas that describe either self or partner speakers on the task of response selection.
Four persona fusion strategies are designed, which assume personas interact with contexts or responses in different ways.
Empirical studies on the Persona-Chat dataset show that the partner personas can improve the accuracy of response selection.
arXiv Detail & Related papers (2021-05-19T10:32:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.