How to Describe Images in a More Funny Way? Towards a Modular Approach
to Cross-Modal Sarcasm Generation
- URL: http://arxiv.org/abs/2211.10992v1
- Date: Sun, 20 Nov 2022 14:38:24 GMT
- Title: How to Describe Images in a More Funny Way? Towards a Modular Approach
to Cross-Modal Sarcasm Generation
- Authors: Jie Ruan, Yue Wu, Xiaojun Wan, Yuesheng Zhu
- Abstract summary: We study a new problem of cross-modal sarcasm generation (CMSG), i.e., generating a sarcastic description for a given image.
CMSG is challenging as models need to satisfy the characteristics of sarcasm, as well as the correlation between different modalities.
We propose an Extraction-Generation-Ranking based Modular method (EGRM) for cross-model sarcasm generation.
- Score: 62.89586083449108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sarcasm generation has been investigated in previous studies by considering
it as a text-to-text generation problem, i.e., generating a sarcastic sentence
for an input sentence. In this paper, we study a new problem of cross-modal
sarcasm generation (CMSG), i.e., generating a sarcastic description for a given
image. CMSG is challenging as models need to satisfy the characteristics of
sarcasm, as well as the correlation between different modalities. In addition,
there should be some inconsistency between the two modalities, which requires
imagination. Moreover, high-quality training data is insufficient. To address
these problems, we take a step toward generating sarcastic descriptions from
images without paired training data and propose an
Extraction-Generation-Ranking based Modular method (EGRM) for cross-model
sarcasm generation. Specifically, EGRM first extracts diverse information from
an image at different levels and uses the obtained image tags, sentimental
descriptive caption, and commonsense-based consequence to generate candidate
sarcastic texts. Then, a comprehensive ranking algorithm, which considers
image-text relation, sarcasticness, and grammaticality, is proposed to select a
final text from the candidate texts. Human evaluation at five criteria on a
total of 1200 generated image-text pairs from eight systems and auxiliary
automatic evaluation show the superiority of our method.
Related papers
- A Survey of Multimodal Sarcasm Detection [32.659528422756416]
Sarcasm is a rhetorical device that is used to convey the opposite of the literal meaning of an utterance.
We present the first comprehensive survey on multimodal sarcasm detection to date.
arXiv Detail & Related papers (2024-10-24T16:17:47Z) - Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection [12.744170917349287]
This study presents a novel framework for multimodal sarcasm detection that can process input triplets.
The proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.
arXiv Detail & Related papers (2024-08-05T16:07:31Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Researchers eye-view of sarcasm detection in social media textual
content [0.0]
Enormous use of sarcastic text in all forms of communication in social media will have a physiological effect on target users.
This paper discusses various sarcasm detection techniques and concludes with some approaches, related datasets with optimal features.
arXiv Detail & Related papers (2023-04-17T19:45:10Z) - Polarity based Sarcasm Detection using Semigraph [0.0]
This article presents the inventive method of the semigraph, including semigraph construction and sarcasm detection processes.
A variation of the semigraph is suggested in the pattern-relatedness of the text document.
The proposed method is to obtain the sarcastic and non-sarcastic polarity scores of a document using a semigraph.
arXiv Detail & Related papers (2023-04-04T00:13:55Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity
Modeling with Knowledge Enhancement [31.97249246223621]
Sarcasm is a linguistic phenomenon indicating a discrepancy between literal meanings and implied intentions.
Most existing techniques only modeled the atomic-level inconsistencies between the text input and its accompanying image.
We propose a novel hierarchical framework for sarcasm detection by exploring both the atomic-level congruity based on multi-head cross attention mechanism and the composition-level congruity based on graph neural networks.
arXiv Detail & Related papers (2022-10-07T12:44:33Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - $R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with
Commonsense Knowledge [51.70688120849654]
We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence.
Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm.
arXiv Detail & Related papers (2020-04-28T02:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.