Related papers: Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

URL: http://arxiv.org/abs/2402.19465v2
Date: Sat, 31 Aug 2024 11:31:02 GMT
Title: Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Authors: Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao,
Abstract summary: We pioneer the exploration of LLM's trustworthiness during pre-training. We focus on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. We are the first to observe a similar two-phase phenomenon: fitting and compression.
Score: 47.439995799065755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.

Related papers

Risk Assessment Framework for Code LLMs via Leveraging Internal States [4.216536684967512]
We propose PtTrust, a two-stage risk assessment framework for code LLM based on internal state pre-training.<n> PtTrust first performs unsupervised pre-training on large-scale unlabeled source code to learn general representations of LLM states.<n>We demonstrate the effectiveness of PtTrust through fine-grained, code line-level risk assessment.
arXiv Detail & Related papers (2025-04-20T14:44:18Z)
Gauging Overprecision in LLMs: An Empirical Study [5.359801516815977]
This study is inspired by a different aspect of overconfidence in cognitive science called textitoverprecision. In the generation phase, we prompt the LLM to generate answers to numerical questions in the form of intervals with a certain level of confidence. In the refinement phase, answers from the previous phase are refined to generate better answers.
arXiv Detail & Related papers (2025-04-16T14:02:21Z)
From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting [22.052783052469344]
Using pre-trained large language models (LLMs) as the backbone for time series prediction has recently gained significant research interest. We observe that training and testing LLM-based models on small datasets often leads to the Decoder and Decoder becoming overly adapted to the dataset. Extensive experiments reveal that although the LLM backbone demonstrates some promise, its forecasting performance is limited.
arXiv Detail & Related papers (2025-04-09T13:20:09Z)
LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression. LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model. Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z)
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Exploring Forgetting in Large Language Model Pre-Training [18.858330348834777]
Catastrophic forgetting remains a formidable obstacle to building an omniscient model in large language models (LLMs) We systematically explored the existence and measurement of forgetting in pre-training, questioning traditional metrics such as perplexity (PPL) and introducing new metrics to better detect entity memory retention.
arXiv Detail & Related papers (2024-10-22T13:39:47Z)
SPOT: Text Source Prediction from Originality Score Thresholding [6.790905400046194]
countermeasures aim at detecting misinformation, usually involve domain specific models trained to recognize the relevance of any information. Instead of evaluating the validity of the information, we propose to investigate LLM generated text from the perspective of trust.
arXiv Detail & Related papers (2024-05-30T21:51:01Z)
Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models [12.687494201105066]
This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) to generate future motion from agents' past/observed trajectories and scene semantics. LLMs' powerful comprehension abilities capture a spectrum of high-level scene knowledge and interactive information. Emulating the human-like lane focus cognitive function, we introduce lane-aware probabilistic learning powered by the pioneering Mamba module.
arXiv Detail & Related papers (2024-05-08T09:28:04Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
TrustLLM: Trustworthiness in Large Language Models [446.5640421311468]
This paper introduces TrustLLM, a comprehensive study of trustworthiness in large language models (LLMs) We first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics.
arXiv Detail & Related papers (2024-01-10T22:07:21Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.