When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
- URL: http://arxiv.org/abs/2512.06343v1
- Date: Sat, 06 Dec 2025 08:15:37 GMT
- Title: When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
- Authors: Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-jui Hsieh,
- Abstract summary: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF.<n>The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses.<n>We propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error.
- Score: 55.444604697848426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we show that the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that balances representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in integration to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous small-distance pairs. This work reveals a key limitation in the widely used BT objective and provides a simple, effective correction.
Related papers
- Hard Negative Sample-Augmented DPO Post-Training for Small Language Models [4.425580048633862]
We propose a lightweight and pragmatic post-training pipeline that targets structured errors under realistic compute budgets.<n>We introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores.<n> Experiments show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO.
arXiv Detail & Related papers (2025-12-17T06:15:52Z) - Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection [85.0189917888094]
We propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework to handle the challenges posed by subtle and infrequent mistakes.<n>The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances.
arXiv Detail & Related papers (2025-09-16T12:00:42Z) - Gradient Extrapolation for Debiased Representation Learning [7.183424522250937]
Gradient Extrapolation for Debiased Representation Learning (GERNE) is designed to learn debiased representations in both known and unknown attribute training cases.<n>Our analysis shows that when the extrapolated gradient points toward the batch gradient with fewer spurious correlations, it effectively guides training toward learning a debiased model.
arXiv Detail & Related papers (2025-03-17T14:48:57Z) - UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning [57.081646768835704]
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs)<n>This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points.<n>We propose UPCORE, a method-agnostic data selection framework for mitigating collateral damage during unlearning.
arXiv Detail & Related papers (2025-02-20T22:51:10Z) - Binary Classifier Optimization for Large Language Model Alignment [4.61411484523337]
In real-world services such as ChatGPT, aligning models based on user feedback is crucial for improving performance.<n>Most existing alignment research relies on preference-based approaches that require both positive and negative responses as a pair.<n>We propose Binary Optimization (BCO), a technique that effectively aligns LLMs using only binary feedback.
arXiv Detail & Related papers (2024-04-06T15:20:59Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Guarding Barlow Twins Against Overfitting with Mixed Samples [27.7244906436942]
Self-supervised learning aims to learn transferable feature representations for downstream applications without relying on labeled data.
We introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples.
arXiv Detail & Related papers (2023-12-04T18:59:36Z) - Prompt Tuning Pushes Farther, Contrastive Learning Pulls Closer: A
Two-Stage Approach to Mitigate Social Biases [13.837927115198308]
We propose an adversarial training-inspired two-stage debiasing model using Contrastive learning and Continuous Prompt Augmentation.
Our approach guides the model to achieve stronger debiasing performance by adding difficulty to the training process.
arXiv Detail & Related papers (2023-07-04T09:35:03Z) - Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set.
We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss.
We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - You Only Need End-to-End Training for Long-Tailed Recognition [8.789819609485225]
Cross-entropy loss tends to produce highly correlated features on imbalanced data.
We propose two novel modules, Block-based Relatively Balanced Batch Sampler (B3RS) and Batch Embedded Training (BET)
Experimental results on the long-tailed classification benchmarks, CIFAR-LT and ImageNet-LT, demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-12-11T11:44:09Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.