SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization
- URL: http://arxiv.org/abs/2602.04811v1
- Date: Wed, 04 Feb 2026 17:58:32 GMT
- Title: SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization
- Authors: Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun,
- Abstract summary: We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers.<n>Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation.<n>Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models
- Score: 52.635237306338574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.
Related papers
- BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation [16.147318846582298]
Simulating student learning behaviors in open-ended problem-solving environments holds potential for education research.<n>However, collecting authentic data is challenging due to privacy concerns and the high cost of longitudinal studies.<n>We present BEAGLE, a neuro-symbolic framework that addresses this bias by incorporating Self-Regulated Learning (SRL) theory into a novel architecture.
arXiv Detail & Related papers (2026-02-06T08:05:15Z) - Toward Training Superintelligent Software Agents through Self-Play SWE-RL [66.11447353341926]
Self-play SWE-RL is a first step toward training paradigms for superintelligent software agents.<n>Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies.<n>Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories.
arXiv Detail & Related papers (2025-12-21T00:49:40Z) - AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library [47.82769337589924]
AlphaOPT is a self-improving experience library for optimization modeling.<n>It learns efficiently from limited demonstrations without rationales.<n>It expands continually without costly retraining by updating the library rather than model weights.
arXiv Detail & Related papers (2025-10-21T09:03:26Z) - RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? [92.4931695205957]
We introduce DELTA-Code, a benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability and transferrability.<n>Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy.<n>To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop.
arXiv Detail & Related papers (2025-09-25T11:20:56Z) - Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation [89.65955788873532]
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP)<n>We propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation.
arXiv Detail & Related papers (2025-05-26T08:18:33Z) - R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning [83.256752220849]
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge.<n>We introduce R1-Searcher++, a framework designed to train LLMs to adaptively leverage both internal and external knowledge sources.<n>Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval.
arXiv Detail & Related papers (2025-05-22T17:58:26Z) - Know Or Not: a library for evaluating out-of-knowledge base robustness [0.0]
We present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of large language models (LLMs)<n>We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation data and pipelines for OOKB robustness.
arXiv Detail & Related papers (2025-05-19T03:17:41Z) - Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data [19.168022702075774]
Openworld continual learning (OWCL) adapts to sequential tasks with open samples, learning knowledge incrementally while preventing forgetting.<n>We propose a novel OFCL framework that integrates three key components: (1) an instance-wise token augmentation (ITA) that represents and enriches sample representations with additional knowledge, (2) a margin-based open boundary (MOB) that supports open detection with new tasks, and (3) an adaptive knowledge space (AKS) that endows unknowns with knowledge for the updating from unknowns to knowns.
arXiv Detail & Related papers (2025-02-28T11:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.