Related papers: TTCS: Test-Time Curriculum Synthesis for Self-Evolving

TTCS: Test-Time Curriculum Synthesis for Self-Evolving

URL: http://arxiv.org/abs/2601.22628v1
Date: Fri, 30 Jan 2026 06:38:02 GMT
Title: TTCS: Test-Time Curriculum Synthesis for Self-Evolving
Authors: Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang, Fei Long, Yuhan Liu, Jinsong Su,
Abstract summary: Test-Time Training offers a promising way to improve the reasoning ability of large language models.<n>We propose TTCS, a co-evolving test-time training framework.<n>We show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks.
Score: 47.826209735956716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.

Related papers

HLTCOE Evaluation Team at TREC 2025: VQA Track [76.85337417923331]
HLT Evaluation team participated in TREC VQA's Answer Generation (AG) task.<n>We developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation.
arXiv Detail & Related papers (2025-12-08T17:25:13Z)
LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge [31.40589987269264]
We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates.<n>Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty.<n> Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries.
arXiv Detail & Related papers (2025-11-03T10:00:49Z)
Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models [44.17697803306198]
We introduce textitCodeSeq, a synthetic post-training dataset built from number sequences.<n>Our pipeline generates supervised fine data by reflecting on failed test cases and incorporating iterative corrections.<n> Experimental results show that the models trained with textitCodeSeq improve on various reasoning tasks and can preserve the models' OOD performance.
arXiv Detail & Related papers (2025-10-16T12:29:40Z)
Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z)
CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs [14.97707719362011]
We propose textbfCycle-textbfConsistency in textbfQuestion textbfAnswering (CCQA)<n>Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response.<n>It is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks.
arXiv Detail & Related papers (2025-09-23T02:01:03Z)
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z)
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling [39.57154199908565]
Self-Enhanced Test-Time Scaling (SETS) is a simple yet effective approach that overcomes limitations by strategically combining parallel and sequential techniques.<n>SETS exploits the inherent self-verification and self- computation capabilities of Large Language Models, unifying sampling, verification, and correction within a single framework.<n>Our results show SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
arXiv Detail & Related papers (2025-01-31T17:03:16Z)
Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems. We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z)
QADYNAMICS: Training Dynamics-Driven Synthetic QA Diagnostic for Zero-Shot Commonsense Question Answering [48.25449258017601]
State-of-the-art approaches fine-tune language models on QA pairs constructed from CommonSense Knowledge Bases. We propose QADYNAMICS, a training dynamics-driven framework for QA diagnostics and refinement.
arXiv Detail & Related papers (2023-10-17T14:27:34Z)
An Empirical Comparison of LM-based Question and Answer Generation Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning. Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z)
Momentum Contrastive Pre-training for Question Answering [54.57078061878619]
MCROSS introduces a momentum contrastive learning framework to align the answer probability between cloze-like and natural query-passage sample pairs. Our method achieves noticeable improvement compared with all baselines in both supervised and zero-shot scenarios.
arXiv Detail & Related papers (2022-12-12T08:28:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.