InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
- URL: http://arxiv.org/abs/2406.16464v5
- Date: Mon, 16 Dec 2024 04:13:38 GMT
- Title: InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
- Authors: Junjie Chen, Hang Yu, Subin Huang, Sanmin Liu, Linfeng Zhang,
- Abstract summary: Sarcasm in social media, often expressed through text-image combinations, poses challenges for sentiment analysis and intention mining.
We propose InterCLIP-MEP, which introduces Interactive CLIP with an efficient training strategy to extract enriched text-image representations.
We show that InterCLIP-MEP achieves state-of-the-art performance, with significant accuracy and F1 score improvements on MMSD and MMSD2.0.
- Score: 17.55808303452098
- License:
- Abstract: Sarcasm in social media, often expressed through text-image combinations, poses challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been demonstrated to overly rely on spurious cues within the textual modality, revealing a limited ability to genuinely identify sarcasm through nuanced text-image interactions. To solve this problem, we propose InterCLIP-MEP, which introduces Interactive CLIP (InterCLIP) with an efficient training strategy to extract enriched text-image representations by embedding cross-modal information directly into each encoder. Additionally, we design a Memory-Enhanced Predictor (MEP) with a dynamic dual-channel memory that stores valuable test sample knowledge during inference, acting as a non-parametric classifier for robust sarcasm recognition. Experiments on two benchmarks demonstrate that InterCLIP-MEP achieves state-of-the-art performance, with significant accuracy and F1 score improvements on MMSD and MMSD2.0. Our code is available at https://github.com/CoderChen01/InterCLIP-MEP.
Related papers
- RCLMuFN: Relational Context Learning and Multiplex Fusion Network for Multimodal Sarcasm Detection [1.023096557577223]
We propose a relational context learning and multiplex fusion network (RCLMuFN) for multimodal sarcasm detection.
Firstly, we employ four feature extractors to comprehensively extract features from raw text and images.
Secondly, we utilize the relational context learning module to learn the contextual information of text and images.
arXiv Detail & Related papers (2024-12-17T15:29:31Z) - AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation [11.568176591294746]
We present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation)
This approach utilizes the Multimodal Sarcasm Detection dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy.
The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations.
arXiv Detail & Related papers (2024-12-13T12:42:51Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis [2.012311338995539]
This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn for utterance-level sentiment and emotion prediction.
To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multimodal benchmark datasets.
arXiv Detail & Related papers (2024-10-02T10:07:48Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - SCMM: Calibrating Cross-modal Representations for Text-Based Person Search [43.17325362167387]
Text-Based Person Search (TBPS) is a crucial task in the Internet of Things (IoT) domain.
For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common space.
We present Sew embedding and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
arXiv Detail & Related papers (2023-04-05T07:50:16Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - Multimodal Learning using Optimal Transport for Sarcasm and Humor
Detection [76.62550719834722]
We deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs.
We propose a novel multimodal learning system, MuLOT, which utilizes self-attention to exploit intra-modal correspondence.
We test our approach for multimodal sarcasm and humor detection on three benchmark datasets.
arXiv Detail & Related papers (2021-10-21T07:51:56Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.