SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding
- URL: http://arxiv.org/abs/2408.11319v2
- Date: Sat, 24 Aug 2024 03:58:40 GMT
- Title: SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding
- Authors: Yazhou Zhang, Chunwang Zou, Zheng Lian, Prayag Tiwari, Jing Qin,
- Abstract summary: We present evaluations on six widely used benchmark datasets through different prompting approaches.
GPT-4 consistently and significantly outperforms other LLMs across various prompting methods.
Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT.
- Score: 19.412462224847086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of large language models (LLMs), the task of ``System I''~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs' success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs' understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0\%$\uparrow$. Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.
Related papers
- Detecting Emotional Incongruity of Sarcasm by Commonsense Reasoning [32.5690489394632]
This paper focuses on sarcasm detection, which aims to identify whether given statements convey criticism, mockery, or other negative sentiment opposite to the literal meaning.
Existing methods lack commonsense inferential ability when they face complex real-world scenarios, leading to unsatisfactory performance.
We propose a novel framework for sarcasm detection, which conducts incongruity reasoning based on commonsense augmentation, called EICR.
arXiv Detail & Related papers (2024-12-17T11:25:55Z) - Pragmatic Metacognitive Prompting Improves LLM Performance on Sarcasm Detection [40.493832682762246]
We introduce Pragmatic Metacognitive Prompting (PMP) to improve the performance of Large Language Models (LLMs) in sarcasm detection.
Using state-of-the-art LLMs such as LLaMA-3-8B, GPT-4o, and Claude 3.5 Sonnet, PMP achieves state-of-the-art performance on GPT-4o.
This study demonstrates that integrating pragmatic reasoning and metacognitive strategies into prompting significantly enhances LLMs' ability to detect sarcasm.
arXiv Detail & Related papers (2024-12-04T07:16:30Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.
GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models? [13.222198659253056]
We introduce a new prompting framework (called SarcasmCue) containing four sub-methods.
It elicits large language models (LLMs) to detect human sarcasm by considering sequential and non-sequential prompting methods.
Our framework consistently pushes the state-of-the-art (i.e., ToT) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets.
arXiv Detail & Related papers (2024-07-17T16:42:03Z) - RVISA: Reasoning and Verification for Implicit Sentiment Analysis [18.836998294161834]
implicit sentiment analysis (ISA) poses a significant challenge with the absence of salient cue words in expressions.
This study proposes RVISA, a two-stage reasoning framework that harnesses the generation ability of DO LLMs and the reasoning ability of ED LLMs to train an enhanced reasoner.
arXiv Detail & Related papers (2024-07-02T15:07:54Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z) - PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts [76.18347405302728]
This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic.
The adversarial prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving.
Our findings demonstrate that contemporary Large Language Models are not robust to adversarial prompts.
arXiv Detail & Related papers (2023-06-07T15:37:00Z) - Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples.
We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer.
Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.