CTL-MTNet: A Novel CapsNet and Transfer Learning-Based Mixed Task Net
for the Single-Corpus and Cross-Corpus Speech Emotion Recognition
- URL: http://arxiv.org/abs/2207.10644v1
- Date: Mon, 18 Jul 2022 09:09:23 GMT
- Title: CTL-MTNet: A Novel CapsNet and Transfer Learning-Based Mixed Task Net
for the Single-Corpus and Cross-Corpus Speech Emotion Recognition
- Authors: Xin-Cheng Wen, Jia-Xin Ye, Yan Luo, Yong Xu, Xuan-Ze Wang, Chang-Li Wu
and Kun-Hong Liu
- Abstract summary: Speech Emotion Recognition (SER) has become a growing focus of research in human-computer interaction.
To address this challenge, a Capsule Network (CapsNet) and Transfer Learning based Mixed Task Net (CTLMTNet) are proposed to deal with both the singlecorpus and cross-corpus SER tasks simultaneously.
The results indicate that in both tasks the CTL-MTNet showed better performance in all cases compared to a number of state-of-the-art methods.
- Score: 15.098532236157556
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech Emotion Recognition (SER) has become a growing focus of research in
human-computer interaction. An essential challenge in SER is to extract common
attributes from different speakers or languages, especially when a specific
source corpus has to be trained to recognize the unknown data coming from
another speech corpus. To address this challenge, a Capsule Network (CapsNet)
and Transfer Learning based Mixed Task Net (CTLMTNet) are proposed to deal with
both the singlecorpus and cross-corpus SER tasks simultaneously in this paper.
For the single-corpus task, the combination of Convolution-Pooling and
Attention CapsNet module CPAC) is designed by embedding the self-attention
mechanism to the CapsNet, guiding the module to focus on the important features
that can be fed into different capsules. The extracted high-level features by
CPAC provide sufficient discriminative ability. Furthermore, to handle the
cross-corpus task, CTL-MTNet employs a Corpus Adaptation Adversarial Module
(CAAM) by combining CPAC with Margin Disparity Discrepancy (MDD), which can
learn the domain-invariant emotion representations through extracting the
strong emotion commonness. Experiments including ablation studies and
visualizations on both singleand cross-corpus tasks using four well-known SER
datasets in different languages are conducted for performance evaluation and
comparison. The results indicate that in both tasks the CTL-MTNet showed better
performance in all cases compared to a number of state-of-the-art methods. The
source code and the supplementary materials are available at:
https://github.com/MLDMXM2017/CTLMTNet
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching
for Autonomous Driving [40.305452898732774]
S$3$M-Net is a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously.
S$3$M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability.
arXiv Detail & Related papers (2024-01-21T06:47:33Z) - Masked Cross-image Encoding for Few-shot Segmentation [16.445813548503708]
Few-shot segmentation (FSS) is a dense prediction task that aims to infer the pixel-wise labels of unseen classes using only a limited number of annotated images.
We propose a joint learning method termed Masked Cross-Image MCE, which is designed to capture common visual properties that describe object details and to learn bidirectional inter-image dependencies that enhance feature interaction.
arXiv Detail & Related papers (2023-08-22T05:36:39Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Semi-Supervised Cross-Modal Salient Object Detection with U-Structure
Networks [18.12933868289846]
We integrate the linguistic information into the vision-based U-Structure networks designed for salient object detection tasks.
We propose a new module called efficient Cross-Modal Self-Attention (eCMSA) to combine visual and linguistic features.
To reduce the heavy burden of labeling, we employ a semi-supervised learning method by training an image caption model.
arXiv Detail & Related papers (2022-08-08T18:39:37Z) - Learn-to-Decompose: Cascaded Decomposition Network for Cross-Domain
Few-Shot Facial Expression Recognition [60.51225419301642]
We propose a novel cascaded decomposition network (CDNet) for compound facial expression recognition.
By training across similar tasks on basic expression datasets, CDNet learns the ability of learn-to-decompose that can be easily adapted to identify unseen compound expressions.
arXiv Detail & Related papers (2022-07-16T16:10:28Z) - CI-Net: Contextual Information for Joint Semantic Segmentation and Depth
Estimation [2.8785764686013837]
We propose a network injected with contextual information (CI-Net) to solve the problem.
With supervision from semantic labels, the network is embedded with contextual information so that it could understand the scene better.
We evaluate the proposed CI-Net on the NYU-Depth-v2 and SUN-RGBD datasets.
arXiv Detail & Related papers (2021-07-29T07:58:25Z) - CTNet: Context-based Tandem Network for Semantic Segmentation [77.4337867789772]
This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information.
To further improve the performance of the learned representations for semantic segmentation, the results of the two context modules are adaptively integrated.
arXiv Detail & Related papers (2021-04-20T07:33:11Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.