Related papers: ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

URL: http://arxiv.org/abs/2407.00087v2
Date: Thu, 03 Oct 2024 18:06:53 GMT
Title: ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Authors: Ju-Seung Byun, Jiyun Chun, Jihyung Kil, Andrew Perrault,
Abstract summary: We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT) First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT) Second, we ask the Teacher to correct the wrong reasoning after the RL stage. With the correction feedback, we stabilize the RL fine-tuned model through SFT.
Score: 13.154512864498912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) excel at comprehending human instructions and demonstrate remarkable results across a broad spectrum of tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) further refine LLMs by aligning them with specific preferences. These methods primarily use ranking-based feedback for entire generations. With advanced AI models (Teacher), such as GPT-4 and Claude 3 Opus, we can request various types of detailed feedback that are expensive for humans to provide. We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT). This sentence-level feedback allows us to consider individual valuable segments, providing more granular rewards for the RL procedure. Second, we ask the Teacher to correct the wrong reasoning after the RL stage. The RL procedure requires massive efforts for hyperparameter tuning and often generates errors like repetitive words and incomplete sentences. With the correction feedback, we stabilize the RL fine-tuned model through SFT. We conduct experiments on multi-model dataset ScienceQA and A-OKVQA to demonstrate the effectiveness of our proposal. ARES rationale reasoning achieves around 70% win rate against baseline models judged by GPT-4o. Additionally, we observe that the improved rationale reasoning leads to a 2.5% increase in inference answer accuracy on average for the multi-modal datasets.

Related papers

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z)
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following [55.60192044049083]
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs)<n>We propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model.<n>We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks.
arXiv Detail & Related papers (2025-06-11T17:10:36Z)
Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z)
Interleaved Reasoning for Large Language Models via Reinforcement Learning [22.403928213802036]
Long chain-of-thought (CoT) enhances large language models' (LLM) reasoning capabilities.<n>We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions.
arXiv Detail & Related papers (2025-05-26T07:58:17Z)
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z)
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning [66.43194385702297]
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL) We propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks.
arXiv Detail & Related papers (2025-04-15T21:37:13Z)
Dual Active Learning for Reinforcement Learning from Human Feedback [13.732678966515781]
Reinforcement learning from human feedback (RLHF) is widely applied to align large language models with human preferences. Human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label. In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem.
arXiv Detail & Related papers (2024-10-03T14:09:58Z)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z)
Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models with human preferences. To alleviate some of the computational burden of fine-tuning, efficient methods, like LoRA were introduced. We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering.
arXiv Detail & Related papers (2024-03-15T21:43:46Z)
Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z)
A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z)
Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination" We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information. Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks. Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL. Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.