ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning
- URL: http://arxiv.org/abs/2507.09482v1
- Date: Sun, 13 Jul 2025 04:03:05 GMT
- Title: ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning
- Authors: Changli Wang, Rui Wu, Fang Yin,
- Abstract summary: We introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target.<n>To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning.<n>We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation.
- Score: 4.440035845914307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.
Related papers
- On the Impact of Language Nuances on Sentiment Analysis with Large Language Models: Paraphrasing, Sarcasm, and Emojis [0.3774866290142281]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, including sentiment analysis.<n>This research explores how textual nuances, including emojis and sarcasm, affect sentiment analysis.
arXiv Detail & Related papers (2025-04-08T01:29:58Z) - Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve
Multimodal Sarcasm Detection [68.82684696740134]
We benchmark the MUStARD dataset with state-of-the-art language, speech, and visual encoders, for fully utilizing the totality of the multi-modal richness that it has to offer.
We propose an extension, which we call emphMUStARD++ Balanced, benchmarking the same with instances from the extension split across both train and test sets, achieving a further 2.4% macro-F1 boost.
arXiv Detail & Related papers (2023-09-29T07:00:41Z) - MMSD2.0: Towards a Reliable Multi-modal Sarcasm Detection System [57.650338588086186]
We introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD.
We present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives.
arXiv Detail & Related papers (2023-07-14T03:22:51Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - HIT-SCIR at MMNLU-22: Consistency Regularization for Multilingual Spoken
Language Understanding [56.756090143062536]
We propose to use consistency regularization based on a hybrid data augmentation strategy.
We conduct experiments on the MASSIVE dataset under both full-dataset and zero-shot settings.
Our proposed method improves the performance on both intent detection and slot filling tasks.
arXiv Detail & Related papers (2023-01-05T11:21:15Z) - Sarcasm Detection Framework Using Emotion and Sentiment Features [62.997667081978825]
We propose a model which incorporates emotion and sentiment features to capture the incongruity intrinsic to sarcasm.
Our approach achieved state-of-the-art results on four datasets from social networking platforms and online media.
arXiv Detail & Related papers (2022-11-23T15:14:44Z) - How to Describe Images in a More Funny Way? Towards a Modular Approach
to Cross-Modal Sarcasm Generation [62.89586083449108]
We study a new problem of cross-modal sarcasm generation (CMSG), i.e., generating a sarcastic description for a given image.
CMSG is challenging as models need to satisfy the characteristics of sarcasm, as well as the correlation between different modalities.
We propose an Extraction-Generation-Ranking based Modular method (EGRM) for cross-model sarcasm generation.
arXiv Detail & Related papers (2022-11-20T14:38:24Z) - Sarcasm Detection in Twitter -- Performance Impact when using Data
Augmentation: Word Embeddings [0.0]
Sarcasm is the use of words usually used to either mock or annoy someone, or for humorous purposes.
We propose a contextual model for sarcasm identification in twitter using RoBERTa and augmenting the dataset.
We achieve performance gain by 3.2% in the iSarcasm dataset when using data augmentation to increase 20% of data labeled as sarcastic.
arXiv Detail & Related papers (2021-08-23T04:24:12Z) - Parallel Deep Learning-Driven Sarcasm Detection from Pop Culture Text
and English Humor Literature [0.76146285961466]
We manually extract the sarcastic word distribution features of a benchmark pop culture sarcasm corpus.
We generate input sequences formed of the weighted vectors from such words.
Our proposed model for detecting sarcasm peaks a training accuracy of 98.95% when trained with the discussed dataset.
arXiv Detail & Related papers (2021-06-10T14:01:07Z) - Sarcasm Detection using Context Separators in Online Discourse [3.655021726150369]
Sarcasm is an intricate form of speech, where meaning is conveyed implicitly.
In this work, we use RoBERTa_large to detect sarcasm in two datasets.
We also assert the importance of context in improving the performance of contextual word embedding models.
arXiv Detail & Related papers (2020-06-01T10:52:35Z) - $R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with
Commonsense Knowledge [51.70688120849654]
We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence.
Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm.
arXiv Detail & Related papers (2020-04-28T02:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.