Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
- URL: http://arxiv.org/abs/2503.04113v1
- Date: Thu, 06 Mar 2025 05:43:35 GMT
- Title: Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
- Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt,
- Abstract summary: Humans often rely on subjective natural language to direct language models (LLMs)<n>In this work, we uncover instances of misalignment between LLMs' actual operational semantics and what humans expect.
- Score: 31.883622870253696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an enthusiastic blogpost, while developers might train models to be helpful and harmless using LLM-based edits. The LLM's operational semantics of such subjective phrases -- how it adjusts its behavior when each phrase is included in the prompt -- thus dictates how aligned it is with human intent. In this work, we uncover instances of misalignment between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more harassing outputs when it edits text to be witty, and Llama 3 8B Instruct produces dishonest articles when instructed to make the articles enthusiastic. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.
Related papers
- Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)
We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy.
We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - Understanding the Dark Side of LLMs' Intrinsic Self-Correction [55.51468462722138]
Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability.<n>Recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts.<n>We identify intrinsic self-correction can cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions.
arXiv Detail & Related papers (2024-12-19T15:39:31Z) - Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits [39.00434175773803]
We hired professional writers to edit paragraphs in several creative domains.<n>We curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy.<n>Analysis of LAMP reveals that none of the LLMs used in our study outperform each other in terms of writing quality.
arXiv Detail & Related papers (2024-09-22T16:13:00Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs [28.58726732808416]
We employ the Greedy Coordinate Gradient to craft prompts that compel large language models to generate coherent responses from seemingly nonsensical inputs.
We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima.
Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.
arXiv Detail & Related papers (2024-04-26T02:29:26Z) - Customizing Language Model Responses with Contrastive In-Context Learning [7.342346948935483]
We propose an approach that uses contrastive examples to better describe our intent.
This involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want LLMs to avoid.
Before generating an answer, we ask the model to analyze the examples to teach itself what to avoid.
This reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer.
arXiv Detail & Related papers (2024-01-30T19:13:12Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Are Large Language Models Temporally Grounded? [38.481606493496514]
We provide Large language models (LLMs) with textual narratives.
We probe them with respect to their common-sense knowledge of the structure and duration of events.
We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
arXiv Detail & Related papers (2023-11-14T18:57:15Z) - Tailoring Personality Traits in Large Language Models via
Unsupervisedly-Built Personalized Lexicons [42.66142331217763]
Personality plays a pivotal role in shaping human expression patterns.
Previous methods relied on fine-tuning large language models (LLMs) on specific corpora.
We have employed a novel Unsupervisedly-Built personalized lexicon (UBPL) in a pluggable manner to manipulate personality traits.
arXiv Detail & Related papers (2023-10-25T12:16:33Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.