Related papers: FiLMing Multimodal Sarcasm Detection with Attention

FiLMing Multimodal Sarcasm Detection with Attention

URL: http://arxiv.org/abs/2110.00416v1
Date: Mon, 9 Aug 2021 06:33:29 GMT
Title: FiLMing Multimodal Sarcasm Detection with Attention
Authors: Sundesh Gupta, Aditya Shah, Miten Shah, Laribok Syiemlieh, Chandresh Maurya
Abstract summary: Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
Score: 0.7340017786387767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. It finds applications in many NLP tasks such as opinion mining, sentiment analysis, etc. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. So far, various approaches have been proposed that uses text and image modality and a fusion of both. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation by conditioning the input image through FiLMed ResNet blocks with the textual features using the GRU network to capture the multimodal information. The output from both the models and the CLS token from RoBERTa is concatenated and used for the final prediction. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal sarcasm detection dataset.

Related papers

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning [22.494473679788396]
Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks.<n>To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND)<n>MFND contains 11 manipulated types, designed to detect and localize highly authentic fake news.
arXiv Detail & Related papers (2025-05-11T00:26:13Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
RCLMuFN: Relational Context Learning and Multiplex Fusion Network for Multimodal Sarcasm Detection [1.023096557577223]
We propose a relational context learning and multiplex fusion network (RCLMuFN) for multimodal sarcasm detection. Firstly, we employ four feature extractors to comprehensively extract features from raw text and images. Secondly, we utilize the relational context learning module to learn the contextual information of text and images.
arXiv Detail & Related papers (2024-12-17T15:29:31Z)
Multimodal Sentiment Analysis Based on BERT and ResNet [0.0]
multimodal sentiment analysis framework combining BERT and ResNet was proposed. BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision. Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%.
arXiv Detail & Related papers (2024-12-04T15:55:20Z)
Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples. We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection [12.744170917349287]
This study presents a novel framework for multimodal sarcasm detection that can process input triplets. The proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.
arXiv Detail & Related papers (2024-08-05T16:07:31Z)
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues [20.587249765287183]
Feature Swapping Multi-modal Reasoning (FSMR) model is designed to enhance multi-modal reasoning through feature swapping. FSMR incorporates a multi-modal cross-attention mechanism, facilitating the joint modeling of textual and visual information. Experiments on the PMR dataset demonstrate FSMR's superiority over state-of-the-art baseline models.
arXiv Detail & Related papers (2024-03-29T07:28:50Z)
Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text. Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information. experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation [53.97962603641629]
We propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM. TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image. TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source semantic relations.
arXiv Detail & Related papers (2023-06-29T03:26:10Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Multimodal Fake News Detection with Adaptive Unimodal Representation Aggregation [28.564442206829625]
AURA is a multimodal fake news detection network with adaptive unimodal representation aggregation. We perform coarse-level fake news detection and cross-modal cosistency learning according to the unimodal and multimodal representations. Experiments on Weibo and Gossipcop prove that AURA can successfully beat several state-of-the-art FND schemes.
arXiv Detail & Related papers (2022-06-12T14:06:55Z)
Exploiting BERT For Multimodal Target SentimentClassification Through Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER [4.510210055307459]
multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. We introduce a method of text-image relation propagation into the multimodal BERT model. We propose a multitask algorithm to train on the MNER datasets.
arXiv Detail & Related papers (2021-02-05T02:45:30Z)
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT. We first represent the input sentence and image using a unified multi-modal graph. We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.