JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs
- URL: http://arxiv.org/abs/2510.17918v1
- Date: Mon, 20 Oct 2025 02:12:49 GMT
- Title: JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs
- Authors: Junlan Feng, Fanyu Meng, Chong Long, Pengyu Cong, Duqing Wang, Yan Zheng, Yuyao Zhang, Xuanchang Gao, Ye Yuan, Yunfei Ma, Zhijie Ren, Fan Yang, Na Wu, Di Jin, Chao Deng,
- Abstract summary: It is widely agreed that unsafe and hallucinations of large language models intrinsically originate from pre-training.<n>Since the data is vast, it's almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases.<n>We propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios.
- Score: 53.59414720003988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The hallucination and credibility concerns of large language models (LLMs) are global challenges that the industry is collectively addressing. Recently, a significant amount of advances have been made on post-training and inference techniques to mitigate these challenges. However, it is widely agreed that unsafe and hallucinations of LLMs intrinsically originate from pre-training, involving pre-training data and the next-token prediction learning mechanism. In this paper, we focus on enhancing pre-training data to improve the trustworthiness and safety of LLMs. Since the data is vast, it's almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases. Moreover, the pre-training data lack grounding in real-world knowledge. Each piece of data is treated as a sequence of tokens rather than as a representation of a part of the world. To overcome these issues, we propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios. We argue that most source data are created by the authors for specific purposes in a certain spatial-temporal context. They have played a role in the real world. By incorporating related world context information, we aim to better anchor pre-training data within real-world scenarios, thereby reducing uncertainty in model training and enhancing the model's safety and trustworthiness. We refer to our Data with World Context as DWC. We continue pre-training an earlier checkpoint of JT-35B-Base with 1.5 trillion of DWC tokens. We introduce our post-training procedures to activate the potentials of DWC. Compared with the Qwen model of a similar scale, JT-Safe-35B achieves an average performance improvement of 1.79% on the Safety and Trustworthy evaluation benchmarks, while being pretrained with only 6.2 trillion tokens.
Related papers
- AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models [75.214287449744]
We introduce a framework for post-training policy refinement built around an Impartial World Model.<n>Our primary contribution is to teach this model to be honest about danger.<n>We demonstrate through extensive experiments, that our model significantly outperforms baselines in predicting failures.
arXiv Detail & Related papers (2025-11-25T13:57:24Z) - Retracing the Past: LLMs Emit Training Data When They Get Lost [18.852558767604823]
memorization of training data in large language models poses significant privacy and copyright concerns.<n>This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data.
arXiv Detail & Related papers (2025-10-27T03:48:24Z) - Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data [68.85234898614571]
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data.<n>While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage, the role of such data in pretraining remains unclear.<n>We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training.
arXiv Detail & Related papers (2025-09-26T20:08:51Z) - Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z) - A Survey on Data Security in Large Language Models [12.23432845300652]
Large Language Models (LLMs) are a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems.<n>Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks.<n>Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning.<n>This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies, including adversarial
arXiv Detail & Related papers (2025-08-04T11:28:34Z) - Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble [2.1450827490014865]
We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers.
We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases.
Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
arXiv Detail & Related papers (2024-09-05T14:35:35Z) - Fed-Credit: Robust Federated Learning with Credibility Management [18.349127735378048]
Federated Learning (FL) is an emerging machine learning approach enabling model training on decentralized devices or data sources.
We propose a robust FL approach based on the credibility management scheme, called Fed-Credit.
The results exhibit superior accuracy and resilience against adversarial attacks, all while maintaining comparatively low computational complexity.
arXiv Detail & Related papers (2024-05-20T03:35:13Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Do Gradient Inversion Attacks Make Federated Learning Unsafe? [70.0231254112197]
Federated learning (FL) allows the collaborative training of AI models without needing to share raw data.
Recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data.
In this work, we show that these attacks presented in the literature are impractical in real FL use-cases and provide a new baseline attack.
arXiv Detail & Related papers (2022-02-14T18:33:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.