A Hierarchical Transformer with Speaker Modeling for Emotion Recognition
in Conversation
- URL: http://arxiv.org/abs/2012.14781v1
- Date: Tue, 29 Dec 2020 14:47:35 GMT
- Title: A Hierarchical Transformer with Speaker Modeling for Emotion Recognition
in Conversation
- Authors: Jiangnan Li, Zheng Lin, Peng Fu, Qingyi Si, Weiping Wang
- Abstract summary: Emotion Recognition in Conversation (ERC) is a personalized and interactive emotion recognition task.
Current method models speakers' interactions by building a relation between every two speakers.
We simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies.
- Score: 12.065178204539693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion Recognition in Conversation (ERC) is a more challenging task than
conventional text emotion recognition. It can be regarded as a personalized and
interactive emotion recognition task, which is supposed to consider not only
the semantic information of text but also the influences from speakers. The
current method models speakers' interactions by building a relation between
every two speakers. However, this fine-grained but complicated modeling is
computationally expensive, hard to extend, and can only consider local context.
To address this problem, we simplify the complicated modeling to a binary
version: Intra-Speaker and Inter-Speaker dependencies, without identifying
every unique speaker for the targeted speaker. To better achieve the simplified
interaction modeling of speakers in Transformer, which shows excellent ability
to settle long-distance dependency, we design three types of masks and
respectively utilize them in three independent Transformer blocks. The designed
masks respectively model the conventional context modeling, Intra-Speaker
dependency, and Inter-Speaker dependency. Furthermore, different speaker-aware
information extracted by Transformer blocks diversely contributes to the
prediction, and therefore we utilize the attention mechanism to automatically
weight them. Experiments on two ERC datasets indicate that our model is
efficacious to achieve better performance.
Related papers
- Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition [27.35304346509647]
We introduce speaker labels into an autoregressive transformer-based speech recognition model.
We then propose a novel speaker mask branch to detection the speech segments of individual speakers.
With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously.
arXiv Detail & Related papers (2023-12-18T06:29:53Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - Speaker-Guided Encoder-Decoder Framework for Emotion Recognition in
Conversation [23.93696773727978]
The emotion recognition in conversation (ERC) task aims to predict the emotion label of an utterance in a conversation.
We design a novel speaker modeling scheme that explores intra- and inter-speaker dependencies jointly in a dynamic manner.
We also propose a Speaker-Guided-Decoder (SGED) framework for ERC, which fully exploits speaker information for the decoding of emotion.
arXiv Detail & Related papers (2022-06-07T10:51:47Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for
Emotion Recognition in Conversation [12.379143886125926]
Emotion recognition in conversation (ERC) has attracted much attention in recent years for its necessity in widespread applications.
Existing ERC methods mostly model the self and inter-speaker context separately, posing a major issue for lacking enough interaction between them.
We propose a novel Speaker and Position-Aware Graph neural network model for ERC (S+), which contains three stages to combine the benefits of both Transformer and relational graph network.
arXiv Detail & Related papers (2021-12-23T07:25:02Z) - Multi-View Self-Attention Based Transformer for Speaker Recognition [33.21173007319178]
Transformer model is widely used for speech processing tasks such as speaker recognition.
We propose a novel multi-view self-attention mechanism for speaker Transformer.
We show that the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
arXiv Detail & Related papers (2021-10-11T07:03:23Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.