Related papers: General learned delegation by clones

General learned delegation by clones

URL: http://arxiv.org/abs/2602.13262v1
Date: Tue, 03 Feb 2026 15:53:35 GMT
Title: General learned delegation by clones
Authors: Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou,
Abstract summary: Serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets.<n>We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts.
Score: 55.144380092379976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.

Related papers

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners [69.66089681814013]
$V_$ is a framework that unifies generation and verification through efficient pairwise ranking.<n>$V_$-Infer improves Pass@1 by up to $10%$ over pointwise verification.<n>$V_$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training.
arXiv Detail & Related papers (2026-03-04T17:22:16Z)
CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs [31.371566320424552]
CoBA-RL is a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability.<n>Our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements.
arXiv Detail & Related papers (2026-02-03T03:14:36Z)
Towards regularized learning from functional data with covariate shift [3.072411352294816]
This paper investigates a general regularization framework for unsupervised domain adaptation in vector-valued regression.<n>By restricting the hypothesis space, we develop a practical operator learning algorithm capable of handling functional outputs.
arXiv Detail & Related papers (2026-01-28T20:30:05Z)
Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z)
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models [99.6720868215076]
We introduce ThreadWeaver, a framework for adaptive parallel reasoning.<n> ThreadWeaver achieves accuracy on par with popular sequential reasoning models of comparable size.<n>We show that ThreadWeaver delivers up to 1.53x average speedup in token latency.
arXiv Detail & Related papers (2025-11-24T18:55:59Z)
Context Attribution with Multi-Armed Bandit Optimization [11.715006981206844]
We propose a novel framework that formulates context attribution as a multi-armed bandit (CMAB) problem.<n>We employ Combinatorial Thompson Sampling (CTS) to efficiently explore the exponentially large space of context subsets under a limited query budget.<n>Our method defines a reward function based on normalized token likelihoods, capturing how well a subset of segments supports the original model response.
arXiv Detail & Related papers (2025-06-24T19:47:27Z)
DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling [20.605487145370752]
Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation.<n>Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints.<n>We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework.
arXiv Detail & Related papers (2025-06-19T05:40:54Z)
CC-LEARN: Cohort-based Consistency Learning [5.7716971260066]
Large language models struggle with consistent, robust reasoning.<n>We introduce cohort-based Consistency Learning (CC-Learn)<n>Experiments show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines.
arXiv Detail & Related papers (2025-06-18T17:41:28Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z)
A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms. We develop a general framework from the perspective of Bregman minimization divergence. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.