Related papers: Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

URL: http://arxiv.org/abs/2506.20666v2
Date: Mon, 07 Jul 2025 00:34:34 GMT
Title: Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
Authors: Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman,
Abstract summary: We use a cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs.<n>Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning.
Score: 13.120615048847434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person's feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.

Related papers

LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z)
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z)
The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models [6.187227278086245]
Large language models (LLMs) have demonstrated emerging capabilities in social intelligence, including implicature resolution and theory-of-mind reasoning.<n>In this work, we evaluate whether LLMs at different training stages can accurately infer speaker intentions.<n>We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization.
arXiv Detail & Related papers (2025-05-24T04:24:59Z)
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws [52.10468229008941]
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting.<n>We provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model.<n>Building on these insights, we introduce a novel method for Contrastive Language-Image Pretraining with a reference model, termed DRRho-CLIP.
arXiv Detail & Related papers (2025-05-10T16:55:03Z)
Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking Users [6.782864450313782]
We study user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics.<n>We propose two methods--Fair-top-k and Dynamic Continual Learning--which significantly reduce recourse cost and improve model robustness.<n>Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.
arXiv Detail & Related papers (2025-03-12T12:17:34Z)
Efficient or Powerful? Trade-offs Between Machine Learning and Deep Learning for Mental Illness Detection on Social Media [0.036136619420474754]
Social media platforms provide valuable insights into mental health trends by capturing user-generated discussions on conditions such as depression, anxiety, and suicidal ideation.<n>Machine learning (ML) and deep learning (DL) models have been increasingly applied to classify mental health conditions from textual data.<n>This study evaluates multiple ML models, including logistic regression, random forest, and LightGBM, alongside deep learning architectures such as ALBERT and Gated Recurrent Units (GRUs)<n>Our findings indicate that ML and DL models achieve comparable classification performance on medium-sized datasets.
arXiv Detail & Related papers (2025-03-03T00:51:41Z)
Generating Computational Cognitive Models using Large Language Models [13.201934636532577]
We introduce a pipeline for Guided generation of Computational Cognitive Models (GeCCo)<n>GeCCo prompts an LLM to propose candidate models, fits proposals to held-out data, and iteratively refines them based on their predictive performance.<n>We benchmark this approach across four different cognitive domains.
arXiv Detail & Related papers (2025-02-02T19:07:13Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort [31.992947353231564]
Concept Bottleneck Models (CBMs) can provide a principled way of disclosing and guiding model behaviors through human-understandable concepts. We propose a novel framework designed to exploit pre-trained models while being immune to these biases, thereby reducing vulnerability to spurious correlations. We evaluate the proposed method on multiple datasets, and the results demonstrate its effectiveness in reducing model reliance on spurious correlations while preserving its interpretability.
arXiv Detail & Related papers (2024-07-12T03:07:28Z)
A Survey on Human Preference Learning for Large Language Models [81.41868485811625]
The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning. This survey covers the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs.
arXiv Detail & Related papers (2024-06-17T03:52:51Z)
Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice [4.029252551781513]
We propose a novel way to enhance the utility of Large Language Models as cognitive models.<n>We show that an LLM pretrained on an ecologically valid arithmetic dataset, predicts human behavior better than many traditional cognitive models.
arXiv Detail & Related papers (2024-05-29T17:37:14Z)
Explaining Large Language Models Decisions Using Shapley Values [1.223779595809275]
Large language models (LLMs) have opened up exciting possibilities for simulating human behavior and cognitive processes. However, the validity of utilizing LLMs as stand-ins for human subjects remains uncertain. This paper presents a novel approach based on Shapley values to interpret LLM behavior and quantify the relative contribution of each prompt component to the model's output.
arXiv Detail & Related papers (2024-03-29T22:49:43Z)
CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques. We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models. We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models. Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
On the model-based stochastic value gradient for continuous reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.