Emotion recognition by fusing time synchronous and time asynchronous
representations
- URL: http://arxiv.org/abs/2010.14102v2
- Date: Thu, 22 Jul 2021 11:46:06 GMT
- Title: Emotion recognition by fusing time synchronous and time asynchronous
representations
- Authors: Wen Wu, Chao Zhang, Philip C. Woodland
- Abstract summary: A novel two-branch neural network model structure is proposed for multimodal emotion recognition.
It consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB)
The two-branch structure achieves state-of-the-art results in 4-way classification with all common test setups.
- Score: 17.26466867595571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, a novel two-branch neural network model structure is proposed
for multimodal emotion recognition, which consists of a time synchronous branch
(TSB) and a time asynchronous branch (TAB). To capture correlations between
each word and its acoustic realisation, the TSB combines speech and text
modalities at each input window frame and then does pooling across time to form
a single embedding vector. The TAB, by contrast, provides cross-utterance
information by integrating sentence text embeddings from a number of context
utterances into another embedding vector. The final emotion classification uses
both the TSB and the TAB embeddings. Experimental results on the IEMOCAP
dataset demonstrate that the two-branch structure achieves state-of-the-art
results in 4-way classification with all common test setups. When using
automatic speech recognition (ASR) output instead of manually transcribed
reference text, it is shown that the cross-utterance information considerably
improves the robustness against ASR errors. Furthermore, by incorporating an
extra class for all the other emotions, the final 5-way classification system
with ASR hypotheses can be viewed as a prototype for more realistic emotion
recognition systems.
Related papers
- Timestamped Embedding-Matching Acoustic-to-Word CTC ASR [2.842794675894731]
We describe a novel method of training an embedding-matching word-level connectionist temporal classification (CTC) automatic speech recognizer (ASR)
The word timestamps enable the ASR to output word segmentations and word confusion networks without relying on a secondary model or forced alignment process when testing.
arXiv Detail & Related papers (2023-06-20T11:53:43Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for
Task-Oriented Dialog Understanding [68.94808536012371]
We propose a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora.
Our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.
arXiv Detail & Related papers (2022-09-14T13:42:50Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - DARER: Dual-task Temporal Relational Recurrent Reasoning Network for
Joint Dialog Sentiment Classification and Act Recognition [39.76268402567324]
Task of joint dialog sentiment classification (DSC) and act recognition (DAR) aims to simultaneously predict the sentiment label and act label for each utterance in a dialog.
We put forward a new framework which models the explicit dependencies via integrating textitprediction-level interactions.
To implement our framework, we propose a novel model dubbed DARER, which first generates the context-, speaker- and temporal-sensitive utterance representations.
arXiv Detail & Related papers (2022-03-08T05:19:18Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - Syntactic representation learning for neural network based TTS with
syntactic parse tree traversal [49.05471750563229]
We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information.
Experimental results demonstrate the effectiveness of our proposed approach.
For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
arXiv Detail & Related papers (2020-12-13T05:52:07Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.