Large Language Models Often Say One Thing and Do Another
- URL: http://arxiv.org/abs/2503.07003v1
- Date: Mon, 10 Mar 2025 07:34:54 GMT
- Title: Large Language Models Often Say One Thing and Do Another
- Authors: Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixiang Zhou, Le Sun, Yingfei Sun,
- Abstract summary: We develop a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT)<n>The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains.<n>The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains.
- Score: 49.22262396351797
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As large language models (LLMs) increasingly become central to various applications and interact with diverse user populations, ensuring their reliable and consistent performance is becoming more important. This paper explores a critical issue in assessing the reliability of LLMs: the consistency between their words and deeds. To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect. The experimental results indicate that alignment only on words or deeds poorly and unpredictably influences the other aspect. This supports our hypothesis that the underlying knowledge guiding LLMs' word or deed choices is not contained within a unified space.
Related papers
- ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning.
We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics.
Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z) - Internal Consistency and Self-Feedback in Large Language Models: A Survey [19.647988281648253]
We use a unified perspective of internal consistency, offering explanations for reasoning deficiencies and hallucinations.
We introduce an effective theoretical framework capable of mining internal consistency, named Self-Feedback.
arXiv Detail & Related papers (2024-07-19T17:59:03Z) - Experimental Pragmatics with Machines: Testing LLM Predictions for the Inferences of Plain and Embedded Disjunctions [4.753535328327316]
We focus on three inferences of plain and embedded disjunctions, and compare them with regular scalar implicatures.
We investigate this comparison from the novel perspective of the predictions of state-of-the-art large language models.
The results of our best performing models mostly align with those of humans, both in the large differences we find between those inferences and implicatures, as well as in fine-grained distinctions among different aspects of those inferences.
arXiv Detail & Related papers (2024-05-09T13:54:15Z) - Word Importance Explains How Prompts Affect Language Model Outputs [0.7223681457195862]
This study presents a method to improve the explainability of large language models by varying individual words in prompts.
Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores.
Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.
arXiv Detail & Related papers (2024-03-05T15:04:18Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - How to Handle Different Types of Out-of-Distribution Scenarios in Computational Argumentation? A Comprehensive and Fine-Grained Field Study [59.13867562744973]
This work systematically assesses LMs' capabilities for out-of-distribution (OOD) scenarios.
We find that the efficacy of such learning paradigms varies with the type of OOD.
Specifically, while ICL excels for domain shifts, prompt-based fine-tuning surpasses for topic shifts.
arXiv Detail & Related papers (2023-09-15T11:15:47Z) - Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.
We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.
We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment
Analysis [56.84237932819403]
This paper aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization.
Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis.
arXiv Detail & Related papers (2022-07-24T03:57:40Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - Compass-aligned Distributional Embeddings for Studying Semantic
Differences across Corpora [14.993021283916008]
We present a framework to support cross-corpora language studies with word embeddings.
CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora.
The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available.
arXiv Detail & Related papers (2020-04-13T15:46:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.