Related papers: A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

URL: http://arxiv.org/abs/2006.15955v1
Date: Mon, 29 Jun 2020 11:51:46 GMT
Title: A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Authors: Jean-Benoit Delbrouck and No\'e Tits and Mathilde Brousmiche and St\'ephane Dupont
Abstract summary: This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities.
Score: 8.927538538637783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed solution has also been submitted to the ACL20: Second Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI dataset. The code to replicate the presented experiments is open-source: https://github.com/jbdel/MOSEI_UMONS.

Related papers

Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition [15.077653455298707]
This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition. We propose a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet) The results show that in scenarios with complex visual and audio cues, the model that fuses the features of ViT and ResNet exhibits superior performance.
arXiv Detail & Related papers (2025-03-21T18:03:44Z)
Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model [5.301672905886949]
This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.
arXiv Detail & Related papers (2024-08-21T02:17:18Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers [1.0152838128195467]
Three input modalities, namely text, audio (speech), and video, are employed to generate multimodal feature vectors. For generating features for each of these modalities, pre-trained Transformer models with fine-tuning are utilized. The best model, which combines feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine, achieves an accuracy of 75.42%.
arXiv Detail & Related papers (2024-02-11T23:27:24Z)
Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems. Deep emotion cues extraction was performed on modalities with strong representation ability. Feature filters were designed as multimodal prompt information for modalities with weak representation ability. MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z)
Exchanging-based Multimodal Fusion with Transformer [19.398692598523454]
We study the problem of multimodal fusion in this paper. Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other. We propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer.
arXiv Detail & Related papers (2023-09-05T12:48:25Z)
Meta-Transformer: A Unified Framework for Multimodal Learning [105.77219833997962]
Multimodal learning aims to build models that process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities. We propose a framework, named Meta-Transformer, that leverages a $textbffrozen$ encoder to perform multimodal perception.
arXiv Detail & Related papers (2023-07-20T12:10:29Z)
Brain encoding models based on multimodal transformers can transfer across language and vision [60.72020004771044]
We used representations from multimodal transformers to train encoding models that can transfer across fMRI responses to stories and movies. We found that encoding models trained on brain responses to one modality can successfully predict brain responses to the other modality.
arXiv Detail & Related papers (2023-05-20T17:38:44Z)
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation. Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition [7.799182201815763]
This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field.
arXiv Detail & Related papers (2020-10-05T14:46:20Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.