FiLMing Multimodal Sarcasm Detection with Attention
- URL: http://arxiv.org/abs/2110.00416v1
- Date: Mon, 9 Aug 2021 06:33:29 GMT
- Title: FiLMing Multimodal Sarcasm Detection with Attention
- Authors: Sundesh Gupta, Aditya Shah, Miten Shah, Laribok Syiemlieh, Chandresh
Maurya
- Abstract summary: Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
- Score: 0.7340017786387767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sarcasm detection identifies natural language expressions whose intended
meaning is different from what is implied by its surface meaning. It finds
applications in many NLP tasks such as opinion mining, sentiment analysis, etc.
Today, social media has given rise to an abundant amount of multimodal data
where users express their opinions through text and images. Our paper aims to
leverage multimodal data to improve the performance of the existing systems for
sarcasm detection. So far, various approaches have been proposed that uses text
and image modality and a fusion of both. We propose a novel architecture that
uses the RoBERTa model with a co-attention layer on top to incorporate context
incongruity between input text and image attributes. Further, we integrate
feature-wise affine transformation by conditioning the input image through
FiLMed ResNet blocks with the textual features using the GRU network to capture
the multimodal information. The output from both the models and the CLS token
from RoBERTa is concatenated and used for the final prediction. Our results
demonstrate that our proposed model outperforms the existing state-of-the-art
method by 6.14% F1 score on the public Twitter multimodal sarcasm detection
dataset.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - RCLMuFN: Relational Context Learning and Multiplex Fusion Network for Multimodal Sarcasm Detection [1.023096557577223]
We propose a relational context learning and multiplex fusion network (RCLMuFN) for multimodal sarcasm detection.
Firstly, we employ four feature extractors to comprehensively extract features from raw text and images.
Secondly, we utilize the relational context learning module to learn the contextual information of text and images.
arXiv Detail & Related papers (2024-12-17T15:29:31Z) - Multimodal Sentiment Analysis Based on BERT and ResNet [0.0]
multimodal sentiment analysis framework combining BERT and ResNet was proposed.
BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision.
Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%.
arXiv Detail & Related papers (2024-12-04T15:55:20Z) - Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection [12.744170917349287]
This study presents a novel framework for multimodal sarcasm detection that can process input triplets.
The proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.
arXiv Detail & Related papers (2024-08-05T16:07:31Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Multi-source Semantic Graph-based Multimodal Sarcasm Explanation
Generation [53.97962603641629]
We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM.
TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image.
TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
arXiv Detail & Related papers (2023-06-29T03:26:10Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER [4.510210055307459]
multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets.
We introduce a method of text-image relation propagation into the multimodal BERT model.
We propose a multitask algorithm to train on the MNER datasets.
arXiv Detail & Related papers (2021-02-05T02:45:30Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.