CTTS: Collective Test-Time Scaling
- URL: http://arxiv.org/abs/2508.03333v2
- Date: Sun, 28 Sep 2025 04:43:05 GMT
- Title: CTTS: Collective Test-Time Scaling
- Authors: Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Lei Bai, Tao Chen, Wanli Ouyang,
- Abstract summary: Test-time scaling (TTS) has emerged as a promising, training-free approach for enhancing large language model (LLM) performance.<n>We introduce Collective Test-Time Scaling (CTTS) to overcome the dominant single test-time scaling (STTS) paradigm.<n>CTTS-MM is a novel framework that operationalizes multi-agent and multi-reward collaboration.
- Score: 58.564620942591866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time scaling (TTS) has emerged as a promising, training-free approach for enhancing large language model (LLM) performance. However, the efficacy of existing methods, such as Best-of-N and Self-Consistency, is fundamentally constrained by the dominant single test-time scaling (STTS) paradigm, which relies on a single LLM agent interacting with a single reward model (SA-SR). Inspired by recent work showing that collective methods can surpass the performance ceiling of individual models, we introduce Collective Test-Time Scaling (CTTS). First, we systematically investigate three primary interaction paradigms of existing multiple models: single-agent-multi-reward (SA-MR), multi-agent-single-reward (MA-SR), and multi-agent-multi-reward (MA-MR). Extensive experiments reveal that the MA-MR paradigm is consistently superior. Based on this finding, we further propose CTTS-MM, a novel framework that operationalizes multi-agent and multi-reward collaboration. CTTS-MM integrates two key technical contributions: (1) for agent collaboration, an Agent Collaboration Search (ACS) that identifies the most effective combination of LLMs from a candidate pool; and (2) for reward model collaboration, a Mixture of Reward Models (MoR) strategy that leverages a Prior Reward model Ensemble Selection (PRES) algorithm to select the optimal ensemble. Evaluations across seven mainstream benchmarks demonstrate that CTTS-MM significantly outperforms leading STTS methods (+4.82% over Best-of-N) and surpasses even flagship proprietary LLMs (+7.06% over GPT-4.1) and open-source LLMs. These results highlight the substantial potential of collective scaling to push the frontier of LLM inference. Code will be released at https://github.com/magent4aci/CTTS-MM.
Related papers
- MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation [64.2621682259008]
Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2)<n>We propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2) to integrate policy learning with multi-agent tree search.<n>We show that MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1 on challenging code generation benchmarks.
arXiv Detail & Related papers (2026-02-08T07:28:44Z) - A Systematic Study of Model Merging Techniques in Large Language Models [43.5967188676583]
Model merging combines multiple fine-tuned checkpoints into a single model without additional training.<n>We present a large-scale, systematic evaluation of six state-of-the-art merging methods.<n>Results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs.
arXiv Detail & Related papers (2025-11-26T14:28:11Z) - Slm-mux: Orchestrating small language models for reasoning [52.461958665375896]
We propose a three-stage approach for orchestrating small language models (SLMs)<n>First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs.<n>With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH.
arXiv Detail & Related papers (2025-10-06T17:49:58Z) - Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning [29.580108004844856]
Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks.<n>Recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks.<n>We introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination.
arXiv Detail & Related papers (2025-04-14T00:27:45Z) - Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z) - Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
Generalist Virtual Agents (GVAs) have shown significant promise in autonomous task execution.<n>To address these challenges, we propose Similar, a Step-Wise Multi-Dimensional Generalist Reward Model.<n>Similar offers fine-grained signals for agent training and can choose better action for inference-time scaling.
arXiv Detail & Related papers (2025-03-24T13:30:47Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration [81.45763823762682]
This work aims to bridge the gap by investigating the problem of data synthesis through multi-agent sampling.<n>We introduce Tree Search-based Orchestrated Agents(TOA), where the workflow evolves iteratively during the sequential sampling process.<n>Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi-agent sampling significantly outperforms single-agent sampling as inference compute scales.
arXiv Detail & Related papers (2024-12-22T15:16:44Z) - Efficient Multi-agent Reinforcement Learning by Planning [33.51282615335009]
Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks.
Most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios.
We propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search.
arXiv Detail & Related papers (2024-05-20T04:36:02Z) - A Framework to Implement 1+N Multi-task Fine-tuning Pattern in LLMs
Using the CGC-LORA Algorithm [7.521690071464451]
We propose a unified framework that implements a 1 + N mutli-task fine-tuning pattern in large language models (LLMs)
Our work aims to take an advantage of both MTL (i.e., CGC) and PEFT (i.e., LoRA) scheme.
arXiv Detail & Related papers (2024-01-22T07:58:31Z) - Towards Robust Multi-Modal Reasoning via Model Selection [7.6621866737827045]
LLM serves as the "brain" of the agent, orchestrating multiple tools for collaborative multi-step task solving.
We propose the $textitM3$ framework as a plug-in with negligible runtime overhead at test-time.
Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies.
arXiv Detail & Related papers (2023-10-12T16:06:18Z) - Multipath agents for modular multitask ML systems [2.579908688646812]
The presented work introduces a novel methodology allowing to define multiple methods as distinct agents.
Agents can collaborate and compete to generate and improve ML models for a given tasks.
arXiv Detail & Related papers (2023-02-06T11:57:45Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.