Related papers: Post-Completion Learning for Language Models

Post-Completion Learning for Language Models

URL: http://arxiv.org/abs/2507.20252v2
Date: Tue, 05 Aug 2025 03:29:38 GMT
Title: Post-Completion Learning for Language Models
Authors: Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Chao Feng, Can Huang,
Abstract summary: Current language model training paradigms terminate learning upon reaching the end-of-sequence (eos>) token.<n>We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion.<n>PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point.
Score: 20.589364712188015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

Related papers

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Reinforcement learning fine-tuning of language model for instruction following and math reasoning [0.0]
This study investigates the effectiveness of reinforcement learning techniques on a compact language model (Qwen2.5-0.5B Base)<n>We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models.<n>Experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results.
arXiv Detail & Related papers (2025-06-11T22:49:42Z)
Estimating the Effects of Sample Training Orders for Large Language Models without Retraining [49.59675538160363]
The order of training samples plays a crucial role in large language models (LLMs)<n>Traditional methods for investigating this effect generally require retraining the model with various sample orders.<n>We improve traditional methods by designing a retraining-free framework.
arXiv Detail & Related papers (2025-05-28T07:07:02Z)
Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models [4.737806982257592]
This study proposes a knowledge distillation algorithm based on large language models and feature alignment.<n>The proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER.
arXiv Detail & Related papers (2024-12-27T04:37:06Z)
Joint Training for Selective Prediction [5.662924503089369]
Selective Prediction methods determine when to adopt a classifier's output versus defer to a human. One previous method involves learning a deferral model based on engineered features. We introduce a novel joint-training approach that simultaneously optimize learned representations used by the classifier module and a learned deferral policy.
arXiv Detail & Related papers (2024-10-31T15:28:26Z)
In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models [37.45103473809928]
We propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data.
arXiv Detail & Related papers (2024-08-07T05:48:05Z)
DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.<n>We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.<n>We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z)
An Emulator for Fine-Tuning Large Language Models using Small Language Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales. We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [65.57123249246358]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT.<n>On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt.<n>On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z)
Temporal Difference Learning for Model Predictive Control [29.217382374051347]
Data-driven model predictive control has two key advantages over model-free methods. TD-MPC achieves superior sample efficiency and performance over prior work on both state and image-based continuous control tasks.
arXiv Detail & Related papers (2022-03-09T18:58:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.