Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
- URL: http://arxiv.org/abs/2502.05242v2
- Date: Wed, 28 May 2025 14:27:44 GMT
- Title: Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
- Authors: Guanxu Chen, Dongrui Liu, Tao Luo, Lijie Hu, Jing Shao,
- Abstract summary: Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear.<n>We propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors.
- Score: 18.837335987273256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.
Related papers
- LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests [7.6818904666624395]
This paper proposes a dual-LLM system and experiments with the usage of LLMs for the generation of compiler tests.<n>It is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically.
arXiv Detail & Related papers (2025-07-29T02:34:28Z) - Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs [89.76543013729594]
Vision Transformers (ViTs) can be integrated with Large Language Model (LLMs) blocks for vision-only tasks.<n>Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning.<n>LUViT bridges this modality mismatch through a synergistic pre-training strategy.
arXiv Detail & Related papers (2025-07-01T13:58:21Z) - EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM [8.3321872381107]
We introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM.<n>Unlike existing methods, EMAC+ dynamically refines high-level textual plans using real-time feedback from a VLM executing low-level visual control tasks.<n>EMAC+ achieves superior task performance, against noisy observations, and efficient learning.
arXiv Detail & Related papers (2025-05-26T12:34:16Z) - Lightweight Latent Verifiers for Efficient Meta-Generation Strategies [0.5892638927736115]
Verifiers are auxiliary models that assess the correctness of outputs generated by base large language models (LLMs)<n>In this work, we introduce a novel lightweight verification approach, LiLaVe, which reliably extracts correctness signals from the hidden states of the base LLM.<n>A key advantage of LiLaVe is its ability to operate with only a small fraction of the computational budget required by traditional LLM-based verifiers.
arXiv Detail & Related papers (2025-04-23T14:33:20Z) - Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [50.16340812031201]
We show that large language models (LLMs) do not update their beliefs as expected from the Bayesian framework.<n>We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
arXiv Detail & Related papers (2025-03-21T20:13:04Z) - Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses.
We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective.
We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search [2.1637240640145343]
Large language models (LLMs) have demonstrated their remarkable capacity across a variety of tasks.<n>To improve LLMs' reasoning ability, process supervision has proven to be better than outcome supervision.<n>In this work, we study using Monte Carlo Tree Search (MCTS) to generate process supervision data with LLMs themselves for training them.
arXiv Detail & Related papers (2025-01-02T12:09:17Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs.
ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z) - Towards Uncovering How Large Language Model Works: An Explainability Perspective [38.07611356855978]
Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque.
This paper aims to uncover the mechanisms underlying LLM functionality through the lens of explainability.
arXiv Detail & Related papers (2024-02-16T13:46:06Z) - AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation [60.40409210088717]
Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study.
Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored.
We design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning.
arXiv Detail & Related papers (2024-02-16T12:47:11Z) - FaithLM: Towards Faithful Explanations for Large Language Models [67.29893340289779]
Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their internal knowledge and reasoning capabilities.
The black-box nature of these models complicates the task of explaining their decision-making processes.
We introduce FaithLM to explain the decision of LLMs with natural language (NL) explanations.
arXiv Detail & Related papers (2024-02-07T09:09:14Z) - Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models [26.11408084129897]
Large Language Models (LLMs) are deployed as powerful tools for several natural language processing (NLP) applications.
Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior.
We discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs.
arXiv Detail & Related papers (2024-02-07T06:32:50Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models [20.28989820878285]
Large language models (LLMs) have achieved remarkable advancements in natural language processing.
The massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments.
arXiv Detail & Related papers (2023-11-15T18:56:23Z) - Explanation-aware Soft Ensemble Empowers Large Language Model In-context
Learning [50.00090601424348]
Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks.
We propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs.
arXiv Detail & Related papers (2023-11-13T06:13:38Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.