Related papers: Too Helpful, Too Harmless, Too Honest or Just Right?

Too Helpful, Too Harmless, Too Honest or Just Right?

URL: http://arxiv.org/abs/2509.08486v2
Date: Mon, 15 Sep 2025 03:28:04 GMT
Title: Too Helpful, Too Harmless, Too Honest or Just Right?
Authors: Gautam Siddharth Kashyap, Mark Dras, Usman Naseem,
Abstract summary: Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks.<n> aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge.<n>We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture.
Score: 19.134202394422285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX's generalization across diverse LLM backbones.

Related papers

A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs [64.8510381475827]
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently.<n>SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized.<n>We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set.
arXiv Detail & Related papers (2026-02-23T15:11:16Z)
When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified [19.134202394422285]
Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)<n>Existing works useSupervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs.<n>We propose AlignX, a two-stage framework that mitigates catastrophic forgetting and improves inference reliability.
arXiv Detail & Related papers (2026-02-07T05:52:57Z)
Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z)
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping [52.02659589971978]
We propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference.<n>MoDES significantly enhances inference speed, improving the prefilling time by 2.16$times$ and the decoding time by 1.26$times$.
arXiv Detail & Related papers (2025-11-19T18:48:27Z)
Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference [2.649774320778185]
We present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource.<n>We show that DynaExq deploys large LLMs on single 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines.
arXiv Detail & Related papers (2025-11-19T01:27:54Z)
Muon: Training and Trade-offs with Latent Attention and MoE [4.500362688166346]
We present a comprehensive theoretical and empirical study of the Muon for training transformers only with a small to medium decoder (30M - 200M parameters)<n>We provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm.
arXiv Detail & Related papers (2025-09-29T07:51:06Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE [12.498106165046233]
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token.<n>MoNE replaces redundant experts with lightweight novices to achieve effective and robust model compression.
arXiv Detail & Related papers (2025-07-01T03:02:59Z)
EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization [46.40666108181214]
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning.<n>MoE models have inherent complexities that challenge conventional quantization techniques.<n>We propose EAQuant, a novel PTQ framework tailored for MoE architectures.
arXiv Detail & Related papers (2025-06-16T10:18:50Z)
HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction [25.739068829471297]
We propose an adaptive bimodal prediction framework that flexibly supports single- or dual-modality inputs.<n>Design dramatically improves H&E-only accuracy from 71.44% to 94.25%, 95.09% with full dual-modality inputs, and maintains 90.28% reliability under single-modality conditions.
arXiv Detail & Related papers (2025-04-12T11:24:06Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs [20.071767063618548]
H3Fusion is a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift.<n>We formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings.<n>Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects.
arXiv Detail & Related papers (2024-11-26T17:42:38Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [16.062265609569003]
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs)<n>We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation; (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens; and (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis.
arXiv Detail & Related papers (2024-05-24T02:50:44Z)
Mitigating the Alignment Tax of RLHF [76.4300447532456]
aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax. We propose model averaging to maximize alignment performance while incurring minimal alignment tax. We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
arXiv Detail & Related papers (2023-09-12T14:16:54Z)
Maximum Entropy Heterogeneous-Agent Reinforcement Learning [45.377385280485065]
Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years.<n>We propose a unified framework for learning policies to resolve issues related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium.<n>Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm.<n>We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game.
arXiv Detail & Related papers (2023-06-19T06:22:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.