Related papers: RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models

URL: http://arxiv.org/abs/2506.02726v1
Date: Tue, 03 Jun 2025 10:36:38 GMT
Title: RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models
Authors: Qihang Yan, Xinyu Zhang, Luming Guo, Qi Zhang, Feifan Liu,
Abstract summary: This paper introduces RACE-Align, a novel framework designed to address the limitations of traditional preference alignment methods.<n> RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning.<n> Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model.
Score: 11.107932406541865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align significantly outperforms the original base model and a model fine-tuned only with Supervised Fine-Tuning (SFT). Improvements were observed across multiple dimensions, including answer accuracy, information richness, application of TCM thinking patterns, logicality and depth of reasoning, and interpretability. These findings suggest RACE-Align offers an effective pathway to enhance LLMs' knowledge application, reasoning reliability, and process transparency in complex vertical domains.

Related papers

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models [0.7366405857677227]
This survey provides a textittheoretical unification of preference learning methods.<n>We formalize each axis with precise definitions and theorems.<n>We synthesize empirical findings across 50+ papers and provide a practitioner's decision guide for method selection.
arXiv Detail & Related papers (2026-01-03T08:33:26Z)
Multi-Objective Reward and Preference Optimization: Theory and Algorithms [3.316593788543852]
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models.<n>ACPO, e-COP, warmPref-PS, PSPL, and MOPO advance RL across average-cost, episodic, and preference-driven paradigms.<n> Collectively, the thesis unifies RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.
arXiv Detail & Related papers (2025-12-11T12:51:21Z)
Deep Unfolding: Recent Developments, Theory, and Design Guidelines [99.63555420898554]
This article provides a tutorial-style overview of deep unfolding, a framework that transforms optimization algorithms into structured, trainable ML architectures.<n>We review the foundations of optimization for inference and for learning, introduce four representative design paradigms for deep unfolding, and discuss the distinctive training schemes that arise from their iterative nature.
arXiv Detail & Related papers (2025-12-03T13:16:35Z)
TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits [7.615431299673158]
Traditional black-box optimization achieves sampling efficiency but lacks circuit understanding.<n>We propose TopoSizing, an end-to-end framework that performs robust circuit understanding directly from raw netlists.
arXiv Detail & Related papers (2025-09-17T16:52:46Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
A Technical Survey of Reinforcement Learning Techniques for Large Language Models [33.38582292895673]
Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs)<n>RLHF remains dominant for alignment, and outcome-based RL such as RLVR significantly improves stepwise reasoning.<n> persistent challenges such as reward hacking, computational costs, and scalable feedback collection underscore the need for continued innovation.
arXiv Detail & Related papers (2025-07-05T19:13:00Z)
RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1 [20.92548890511589]
This paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs)<n> RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty.
arXiv Detail & Related papers (2025-06-24T01:39:34Z)
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes [54.93980123979578]
We introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences.<n>LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data.
arXiv Detail & Related papers (2025-05-08T06:59:06Z)
Training Large Language Models to Reason via EM Policy Gradient [0.27195102129094995]
We introduce an off-policy reinforcement learning algorithm, EM Policy Gradient, to enhance LLM reasoning.<n>We evaluate the effectiveness of EM Policy Gradient on the GSM8K and MATH (HARD) datasets.<n>Models fine-tuned with our method exhibit cognitive behaviors, such as sub-problem decomposition, self-verification, and backtracking.
arXiv Detail & Related papers (2025-04-24T01:31:05Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z)
Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment [45.45508377432791]
This paper introduces Reward-Aware Preference Optimization (RPO), a mathematical framework that unifies popular preference optimization techniques.<n>RPO provides a structured approach to disentangle and systematically study the impact of various design choices.<n>We propose a new experimental setup that enables the clean and direct ablation of such design choices.
arXiv Detail & Related papers (2025-01-31T22:39:04Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world. It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.