A Dual-Stream Recurrence-Attention Network With Global-Local Awareness
for Emotion Recognition in Textual Dialog
- URL: http://arxiv.org/abs/2307.00449v2
- Date: Wed, 22 Nov 2023 15:43:23 GMT
- Title: A Dual-Stream Recurrence-Attention Network With Global-Local Awareness
for Emotion Recognition in Textual Dialog
- Authors: Jiang Li, Xiaoping Wang, Zhigang Zeng
- Abstract summary: We propose a simple and effective Dual-stream Recurrence-Attention Network (DualRAN)
DualRAN eschews the complex components of current methods and focuses on combining recurrence-based methods with attention-based ones.
We show that DualRAN boosts the weighted F1 scores by 1.43% and 0.64% on the IEMOCAP and MELD datasets, respectively.
- Score: 41.72374101704424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In real-world dialog systems, the ability to understand the user's emotions
and interact anthropomorphically is of great significance. Emotion Recognition
in Conversation (ERC) is one of the key ways to accomplish this goal and has
attracted growing attention. How to model the context in a conversation is a
central aspect and a major challenge of ERC tasks. Most existing approaches
struggle to adequately incorporate both global and local contextual
information, and their network structures are overly sophisticated. For this
reason, we propose a simple and effective Dual-stream Recurrence-Attention
Network (DualRAN), which is based on Recurrent Neural Network (RNN) and
Multi-head ATtention network (MAT). DualRAN eschews the complex components of
current methods and focuses on combining recurrence-based methods with
attention-based ones. DualRAN is a dual-stream structure mainly consisting of
local- and global-aware modules, modeling a conversation simultaneously from
distinct perspectives. In addition, we develop two single-stream network
variants for DualRAN, i.e., SingleRANv1 and SingleRANv2. According to the
experimental findings, DualRAN boosts the weighted F1 scores by 1.43% and 0.64%
on the IEMOCAP and MELD datasets, respectively, in comparison to the strongest
baseline. On two other datasets (i.e., EmoryNLP and DailyDialog), our method
also attains competitive results.
Related papers
- BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning [26.400567961735234]
Correspondence pruning aims to establish reliable correspondences between two related images.
Existing approaches often employ a progressive strategy to handle the local and global contexts.
We propose a parallel context learning strategy that involves acquiring bilateral consensus for the two-view correspondence pruning task.
arXiv Detail & Related papers (2024-01-07T11:38:15Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
Recognition [41.837538440839815]
We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition.
The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model.
In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer.
arXiv Detail & Related papers (2023-04-14T03:25:00Z) - Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.
Despite significant progress, there are still two major challenges on the way towards robust MSA.
We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - A cross-modal fusion network based on self-attention and residual
structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition.
To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset.
The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z) - PIN: A Novel Parallel Interactive Network for Spoken Language
Understanding [68.53121591998483]
In the existing RNN-based approaches, ID and SF tasks are often jointly modeled to utilize the correlation information between them.
The experiments on two benchmark datasets, i.e., SNIPS and ATIS, demonstrate the effectiveness of our approach.
More encouragingly, by using the feature embedding of the utterance generated by the pre-trained language model BERT, our method achieves the state-of-the-art among all comparison approaches.
arXiv Detail & Related papers (2020-09-28T15:59:31Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.