Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis
- URL: http://arxiv.org/abs/2510.07096v1
- Date: Wed, 08 Oct 2025 14:53:48 GMT
- Title: Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis
- Authors: Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler,
- Abstract summary: Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis.<n>We propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis.
- Score: 19.632399543819382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.
Related papers
- World model inspired sarcasm reasoning with large language model agents [0.0]
Sarcasm understanding is a challenging problem in natural language processing.<n>Most existing approaches still rely on black-box predictions of a single model.<n>We propose World Model inspired SArcasm Reasoning (WM-SAR)
arXiv Detail & Related papers (2025-12-30T16:31:08Z) - MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection [16.725936163763684]
VideoLMs struggle with complex tasks like sarcasm detection.<n>MUStReason is a diagnostic benchmark enriched with annotations of modality-specific relevant cues.<n>We propose PragCoT, a framework that steers VideoLMs to focus on implied intentions over literal meaning.
arXiv Detail & Related papers (2025-10-27T18:03:11Z) - Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis [14.798970809585066]
Sarcastic speech synthesis is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction.<n>This study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process.
arXiv Detail & Related papers (2025-08-18T15:44:54Z) - Detecting Emotional Incongruity of Sarcasm by Commonsense Reasoning [32.5690489394632]
This paper focuses on sarcasm detection, which aims to identify whether given statements convey criticism, mockery, or other negative sentiment opposite to the literal meaning.<n>Existing methods lack commonsense inferential ability when they face complex real-world scenarios, leading to unsatisfactory performance.<n>We propose a novel framework for sarcasm detection, which conducts incongruity reasoning based on commonsense augmentation, called EICR.
arXiv Detail & Related papers (2024-12-17T11:25:55Z) - Expressivity and Speech Synthesis [51.75420054449122]
We outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity.<n>We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology.
arXiv Detail & Related papers (2024-04-30T08:47:24Z) - Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue [63.32199372362483]
We propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE.<n>In particular, we first propose a lexicon-guided utterance sentiment inference module, where a utterance sentiment refinement strategy is devised.<n>We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip.
arXiv Detail & Related papers (2024-02-06T03:14:46Z) - DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations.
We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z) - Sarcasm Detection Framework Using Emotion and Sentiment Features [62.997667081978825]
We propose a model which incorporates emotion and sentiment features to capture the incongruity intrinsic to sarcasm.
Our approach achieved state-of-the-art results on four datasets from social networking platforms and online media.
arXiv Detail & Related papers (2022-11-23T15:14:44Z) - How to Describe Images in a More Funny Way? Towards a Modular Approach
to Cross-Modal Sarcasm Generation [62.89586083449108]
We study a new problem of cross-modal sarcasm generation (CMSG), i.e., generating a sarcastic description for a given image.
CMSG is challenging as models need to satisfy the characteristics of sarcasm, as well as the correlation between different modalities.
We propose an Extraction-Generation-Ranking based Modular method (EGRM) for cross-model sarcasm generation.
arXiv Detail & Related papers (2022-11-20T14:38:24Z) - Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity
Modeling with Knowledge Enhancement [31.97249246223621]
Sarcasm is a linguistic phenomenon indicating a discrepancy between literal meanings and implied intentions.
Most existing techniques only modeled the atomic-level inconsistencies between the text input and its accompanying image.
We propose a novel hierarchical framework for sarcasm detection by exploring both the atomic-level congruity based on multi-head cross attention mechanism and the composition-level congruity based on graph neural networks.
arXiv Detail & Related papers (2022-10-07T12:44:33Z) - Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism [7.194040730138362]
We construct a Contras-tive-Attention-based Sarcasm Detection (ConAttSD) model, which uses an inter-modality contrastive attention mechanism to extract contrastive features for an utterance.
Our experiments on MUStARD, a benchmark multi-modal sarcasm dataset, demonstrate the effectiveness of the proposed ConAttSD model.
arXiv Detail & Related papers (2021-09-30T14:17:51Z) - $R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with
Commonsense Knowledge [51.70688120849654]
We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence.
Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm.
arXiv Detail & Related papers (2020-04-28T02:30:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.