Video Sentiment Analysis with Bimodal Information-augmented Multi-Head
Attention
- URL: http://arxiv.org/abs/2103.02362v1
- Date: Wed, 3 Mar 2021 12:30:11 GMT
- Title: Video Sentiment Analysis with Bimodal Information-augmented Multi-Head
Attention
- Authors: Ting Wu, Junjie Peng, Wenqiang Zhang, Huiran Zhang, Chuanshuai Ma and
Yansong Huang
- Abstract summary: This study focuses on the sentiment analysis of videos containing time series data of multiple modalities.
The key problem is how to fuse these heterogeneous data.
Based on bimodal interaction, more important bimodal features are assigned larger weights.
- Score: 7.997124140597719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis is the basis of intelligent human-computer interaction. As
one of the frontier research directions of artificial intelligence, it can help
computers better identify human intentions and emotional states so that provide
more personalized services. However, as human present sentiments by spoken
words, gestures, facial expressions and others which involve variable forms of
data including text, audio, video, etc., it poses many challenges to this
study. Due to the limitations of unimodal sentiment analysis, recent research
has focused on the sentiment analysis of videos containing time series data of
multiple modalities. When analyzing videos with multimodal data, the key
problem is how to fuse these heterogeneous data. In consideration that the
contribution of each modality is different, current fusion methods tend to
extract the important information of single modality prior to fusion, which
ignores the consistency and complementarity of bimodal interaction and has
influences on the final decision. To solve this problem, a video sentiment
analysis method using multi-head attention with bimodal information augmented
is proposed. Based on bimodal interaction, more important bimodal features are
assigned larger weights. In this way, different feature representations are
adaptively assigned corresponding attention for effective multimodal fusion.
Extensive experiments were conducted on both Chinese and English public
datasets. The results show that our approach outperforms the existing methods
and can give an insight into the contributions of bimodal interaction among
three modalities.
Related papers
- Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z) - End-to-end Semantic-centric Video-based Multimodal Affective Computing [27.13963885724786]
We propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos.
We employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information.
SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels.
arXiv Detail & Related papers (2024-08-14T17:50:27Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Multimodal Representations Learning Based on Mutual Information
Maximization and Minimization and Identity Embedding for Multimodal Sentiment
Analysis [33.73730195500633]
We propose a multimodal representation model based on Mutual information Maximization and Identity Embedding.
Experimental results on two public datasets demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-01-10T01:41:39Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - MISA: Modality-Invariant and -Specific Representations for Multimodal
Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces.
The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap.
Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.