RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
- URL: http://arxiv.org/abs/2506.05070v1
- Date: Thu, 05 Jun 2025 14:18:21 GMT
- Title: RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation
- Authors: Tianjiao Li, Mengran Yu, Chenyu Shi, Yanjun Zhao, Xiaojing Liu, Qiang Zhang, Qi Zhang, Xuanjing Huang, Jiayin Wang,
- Abstract summary: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback with translation tasks has shown great potential.<n>We observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks.<n>We propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM.
- Score: 33.79108789619648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.
Related papers
- RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation [31.28415780479141]
Reinforcement Learning from Teacher-Model Refinement (RLfR) is a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o)<n>On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines.
arXiv Detail & Related papers (2025-07-29T20:35:35Z) - MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models [95.6332110724999]
Motivation-enhanced Reinforcement Finetuning (MeRF) is an intuitive yet effective method enhancing reinforcement learning of Large Language Models (LLMs)<n>MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective.<n> Empirical evaluations on the Knights and Knaves(K&K) logic puzzle reasoning benchmark demonstrate that textttMeRF achieves substantial performance gains over baselines.
arXiv Detail & Related papers (2025-06-23T10:37:57Z) - ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning [77.41383117199227]
We design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM.<n>Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation.<n>We extend our method to the multilingual settings with 11 languages.
arXiv Detail & Related papers (2025-05-19T11:34:47Z) - Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings [25.851419860597407]
We propose a novel approach that leverages fine-grained, token-level quality assessments along with error severity levels usingReinforcement learning.<n>We conduct experiments on small and large translation datasets with standard encoder-decoder and large language models-based machine translation systems.<n>Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to both automatic and human evaluation.
arXiv Detail & Related papers (2024-11-08T21:55:37Z) - Cross-lingual Transfer of Reward Models in Multilingual Alignment [8.13893128694698]
Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs)<n>Recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments.<n>We investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English.
arXiv Detail & Related papers (2024-10-23T17:00:13Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Imitating Language via Scalable Inverse Reinforcement Learning [34.161807103808016]
We focus on investigating the inverse reinforcement learning perspective to imitation.<n>We find clear advantages for IRL-based imitation, in particular for retaining diversity while maximizing task performance.
arXiv Detail & Related papers (2024-09-02T16:48:57Z) - Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement [26.26493253161022]
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT)
We introduce a systematic LLM-based self-refinement translation framework, named textbfTEaR.
arXiv Detail & Related papers (2024-02-26T07:58:12Z) - Advancing Translation Preference Modeling with RLHF: A Step Towards
Cost-Effective Solution [57.42593422091653]
We explore leveraging reinforcement learning with human feedback to improve translation quality.
A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality.
arXiv Detail & Related papers (2024-02-18T09:51:49Z) - Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning.
Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.