Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
- URL: http://arxiv.org/abs/2402.03658v1
- Date: Tue, 6 Feb 2024 03:14:46 GMT
- Title: Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
- Authors: Kun Ouyang and Liqiang Jing and Xuemeng Song and Meng Liu and Yupeng
Hu and Liqiang Nie
- Abstract summary: We propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE.
In particular, we first propose a lexicon-guided utterance sentiment inference module, where a utterance sentiment refinement strategy is devised.
We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip.
- Score: 67.09698638709065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which
aims to generate a natural language explanation for the given sarcastic
dialogue that involves multiple modalities (i.e., utterance, video, and audio).
Although existing studies have achieved great success based on the generative
pretrained language model BART, they overlook exploiting the sentiments
residing in the utterance, video and audio, which are vital clues for sarcasm
explanation. In fact, it is non-trivial to incorporate sentiments for boosting
SED performance, due to three main challenges: 1) diverse effects of utterance
tokens on sentiments; 2) gap between video-audio sentiment signals and the
embedding space of BART; and 3) various relations among utterances, utterance
sentiments, and video-audio sentiments. To tackle these challenges, we propose
a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation
framework, named EDGE. In particular, we first propose a lexicon-guided
utterance sentiment inference module, where a heuristic utterance sentiment
refinement strategy is devised. We then develop a module named Joint Cross
Attention-based Sentiment Inference (JCA-SI) by extending the multimodal
sentiment analysis model JCA to derive the joint sentiment label for each
video-audio clip. Thereafter, we devise a context-sentiment graph to
comprehensively model the semantic relations among the utterances, utterance
sentiments, and video-audio sentiments, to facilitate sarcasm explanation
generation. Extensive experiments on the publicly released dataset WITS verify
the superiority of our model over cutting-edge methods.
Related papers
- PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA)
To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements.
To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z) - VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue.
We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data.
We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - SAIDS: A Novel Approach for Sentiment Analysis Informed of Dialect and
Sarcasm [0.0]
This paper introduces a novel system (SAIDS) that predicts the sentiment, sarcasm and dialect of Arabic tweets.
By training all tasks together, SAIDS results of 75.98 FPN, 59.09 F1-score and 71.13 F1-score for sentiment analysis, sarcasm detection, and dialect identification respectively.
arXiv Detail & Related papers (2023-01-06T14:19:46Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Explaining (Sarcastic) Utterances to Enhance Affect Understanding in
Multimodal Dialogues [40.80696210030204]
We propose MOSES, a deep neural network, which takes a multimodal (sarcastic) dialogue instance as an input and generates a natural language sentence as its explanation.
We leverage the generated explanation for various natural language understanding tasks in a conversational dialogue setup, such as sarcasm detection, humour identification, and emotion recognition.
Our evaluation shows that MOSES outperforms the state-of-the-art system for SED by an average of 2% on different evaluation metrics.
arXiv Detail & Related papers (2022-11-20T18:05:43Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z) - A Deep Neural Framework for Contextual Affect Detection [51.378225388679425]
A short and simple text carrying no emotion can represent some strong emotions when reading along with its context.
We propose a Contextual Affect Detection framework which learns the inter-dependence of words in a sentence.
arXiv Detail & Related papers (2020-01-28T05:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.