Related papers: VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

URL: http://arxiv.org/abs/2410.08048v1
Date: Thu, 10 Oct 2024 15:43:55 GMT
Title: VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers
Authors: Jianing Qi, Hao Tang, Zhigang Zhu,
Abstract summary: VerifierQ is a novel approach that integrates Offline Q-learning into verifier models. We address three key challenges in applying Q-learning to LLMs. Our method enables parallel Q-value computation and improving training efficiency.
Score: 7.7705926659081275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.

Related papers

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z)
ShiQ: Bringing back Bellman to LLMs [37.70964838115103]
We build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement.<n>We evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings.
arXiv Detail & Related papers (2025-05-16T10:12:11Z)
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization [50.485788083202124]
Reinforcement Learning (RL) plays a crucial role in aligning large language models with human preferences and improving their ability to perform complex tasks. We introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning language models.
arXiv Detail & Related papers (2024-10-11T23:29:20Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Quantum Machine Learning Architecture Search via Deep Reinforcement Learning [8.546707309430593]
We introduce deep reinforcement learning to explore proficient QML model architectures tailored for supervised learning tasks. Our methodology involves training an RL agent to devise policies that facilitate the discovery of QML models without predetermined ansatz. Our proposed method successfully identifies VQC architectures capable of achieving high classification accuracy while minimizing gate depth.
arXiv Detail & Related papers (2024-07-29T16:20:51Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Challenges for Reinforcement Learning in Quantum Circuit Design [8.894627352356302]
Hybrid quantum machine learning (QML) comprises both the application of QC to improve machine learning (ML) and ML to improve QC architectures. We propose qcd-gym, a concrete framework formalized as a Markov decision process, to enable learning policies capable of controlling a universal set of continuously parameterized quantum gates.
arXiv Detail & Related papers (2023-12-18T16:41:30Z)
Efficient quantum recurrent reinforcement learning via quantum reservoir computing [3.6881738506505988]
Quantum reinforcement learning (QRL) has emerged as a framework to solve sequential decision-making tasks. This work presents a novel approach to address this challenge by constructing QRL agents utilizing QRNN-based quantum long short-term memory (QLSTM)
arXiv Detail & Related papers (2023-09-13T22:18:38Z)
Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results. We present novel methods Q-Rex and Q-RexDaRe. We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z)
Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.