Related papers: Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

URL: http://arxiv.org/abs/2505.16483v1
Date: Thu, 22 May 2025 10:10:07 GMT
Title: Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Authors: Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun,
Abstract summary: We propose a systematic framework, CANOE, to improve the faithfulness of large language models (LLMs) in both short-form and long-form generation tasks without human annotations.<n>Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data.<n> Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs.
Score: 80.27561080938747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

Related papers

Aligning Large Language Models via Fully Self-Synthetic Data [20.05693955243206]
Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets.<n>In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment.<n>Experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval2.0.
arXiv Detail & Related papers (2025-10-08T05:07:45Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation [12.736045604858738]
Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data.<n>We propose a paradigm shift named textbfNOMAD by investigating how to specifically train models for data generation.
arXiv Detail & Related papers (2024-10-27T07:38:39Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Self-Boosting Large Language Models with Synthetic Preference Data [97.94185115047999]
We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities. SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
arXiv Detail & Related papers (2024-10-09T14:57:31Z)
Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.<n>The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.<n>We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z)
Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.<n>As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.<n>This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)<n>By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z)
Instructed Language Models with Retrievers Are Powerful Entity Linkers [87.16283281290053]
Instructed Generative Entity Linker (INSGENEL) is the first approach that enables casual language models to perform entity linking over knowledge bases. INSGENEL outperforms previous generative alternatives with +6.8 F1 points gain on average.
arXiv Detail & Related papers (2023-11-06T16:38:51Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z)
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions. This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision. We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.