Related papers: RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

URL: http://arxiv.org/abs/2507.22219v1
Date: Tue, 29 Jul 2025 20:35:35 GMT
Title: RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
Authors: Dongyub Jude Lee, Zhenyi Ye, Pengcheng He,
Abstract summary: Reinforcement Learning from Teacher-Model Refinement (RLfR) is a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o)<n>On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines.
Score: 31.28415780479141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

Related papers

RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation [33.79108789619648]
Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback with translation tasks has shown great potential.<n>We observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks.<n>We propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM.
arXiv Detail & Related papers (2025-06-05T14:18:21Z)
Towards Better Instruction Following Retrieval Models [30.99867106106421]
We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR.<n>InF-IR expands traditional training pairs into over 38,000 expressive instruction, query, passage> triplets as positive samples.<n>We generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness.
arXiv Detail & Related papers (2025-05-27T17:14:37Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models [22.613040767122225]
We propose a Preference-Aligned Distillation framework, which models teacher's preference knowledge as a probability distribution over all potential preferences.<n>Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-02-20T05:18:23Z)
SEFL: Enhancing Educational Assignment Feedback with LLM Agents [5.191286314473505]
Synthetic Educational Feedback Loops (SEFL) is a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale.<n>To get this type of data, two large language models (LLMs) operate in teacher-student roles to simulate assignment completion and formative feedback.<n>We show that SEFL-tuned models outperform both their non-tuned counterparts in feedback quality and an existing baseline.
arXiv Detail & Related papers (2025-02-18T15:09:29Z)
Imitating Language via Scalable Inverse Reinforcement Learning [34.161807103808016]
We focus on investigating the inverse reinforcement learning perspective to imitation.<n>We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance.
arXiv Detail & Related papers (2024-09-02T16:48:57Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. We propose the TasTe framework, which stands for translating through self-reflection. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z)
A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI) KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z)
Self-Paced Learning for Neural Machine Translation [55.41314278859938]
We propose self-paced learning for neural machine translation (NMT) training. We show that the proposed model yields better performance than strong baselines.
arXiv Detail & Related papers (2020-10-09T11:33:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.