Related papers: Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses

Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses

URL: http://arxiv.org/abs/2507.21132v1
Date: Tue, 22 Jul 2025 14:11:13 GMT
Title: Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses
Authors: Joshua Adrian Cahyono, Saran Subramanian,
Abstract summary: Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses.<n>This paper investigates these failure modes through three experiments.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a "high-stakes" activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini remain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions, a key feature of a safe, inquisitive approach, rather than issuing prescriptive advice. Furthermore, we demonstrate that a model's cautiousness can be directly controlled via activation steering, suggesting a new path for safety alignment. These findings underscore the need for nuanced, multi-faceted benchmarks to ensure LLMs can be trusted with life-changing decisions.

Related papers

Automating Steering for Safe Multimodal Large Language Models [36.99946524593795]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
SAFER: Probing Safety in Reward Models with Sparse Autoencoder [15.804171763844323]
We present sparse Autoencoder For Enhanced Reward model (textbfSAFER)<n>We uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making.<n>Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification.
arXiv Detail & Related papers (2025-07-01T11:04:03Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts [11.833385600241915]
Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content.<n>This study evaluates a diverse set of 11 LLM judge models across critical safety domains.<n>Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer.
arXiv Detail & Related papers (2025-03-12T12:49:02Z)
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
On Verbalized Confidence Scores for LLMs [25.160810008907397]
Uncertainty quantification for large language models (LLMs) can establish more human trust into their responses.<n>This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens.<n>We assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods.
arXiv Detail & Related papers (2024-12-19T11:10:36Z)
On Evaluating the Durability of Safeguards for Open-Weight LLMs [80.36750298080275]
We discuss whether technical safeguards can impede the misuse of large language models (LLMs)<n>We show that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are.<n>We suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models.
arXiv Detail & Related papers (2024-12-10T01:30:32Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.