CWM: An Open-Weights LLM for Research on Code Generation with World Models
- URL: http://arxiv.org/abs/2510.02387v1
- Date: Tue, 30 Sep 2025 21:47:10 GMT
- Title: CWM: An Open-Weights LLM for Research on Code Generation with World Models
- Authors: FAIR CodeGen team, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabian Gloeckle, Alex Gu, Michael Hassid, Daniel Haziza, Badr Youbi Idrissi, Christian Keller, Rahul Kindi, Hugh Leather, Gallil Maimon, Aram Markosyan, Francisco Massa, Pierre-Emmanuel Mazaré, Vegard Mella, Naila Murray, Keyur Muzumdar, Peter O'Hearn, Matteo Pagliardini, Dmitrii Pedchenko, Tal Remez, Volker Seeker, Marco Selvi, Oren Sultan, Sida Wang, Luca Wehrstedt, Ori Yoran, Lingming Zhang, Taco Cohen, Yossi Adi, Gabriel Synnaeve,
- Abstract summary: We release Code World Model (CWM) to advance research on code generation with world models.<n>We mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments.<n>We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit the latter.
- Score: 78.0342683953353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.
Related papers
- From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z) - Program Semantic Inequivalence Game with Large Language Models [20.43560028315856]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z) - Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo [90.78001821963008]
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints.<n>We develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC)<n>Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language.
arXiv Detail & Related papers (2025-04-17T17:49:40Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors.
We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder.
We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z) - In-Context Code-Text Learning for Bimodal Software Engineering [26.0027882745058]
Bimodal software analysis initially appeared to be within reach with the advent of large language models.
We postulate that in-context learning for the code-text bimodality is a promising avenue.
We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format.
arXiv Detail & Related papers (2024-10-08T19:42:00Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - PWM: Policy Learning with Multi-Task World Models [37.678858748473196]
World model methods offer scalability by learning a simulation of the environment.<n> gradient-based methods exhibit lower variance but fail to handle discontinuities.<n>We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control.
arXiv Detail & Related papers (2024-07-02T17:47:03Z) - Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [5.913758275518443]
We consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL)
Calling code instead of LLMs for planning has potential to be more precise, reliable, interpretable, and extremely efficient.
We show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.
arXiv Detail & Related papers (2024-05-24T09:31:26Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology [4.2990995991059275]
Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) have transformed the field of Software Engineering.
We introduce CodePori, a novel system designed to automate code generation for large and complex software projects.
Results: CodePori is able to generate running code for large-scale projects, aligned with the typical software development process.
arXiv Detail & Related papers (2024-02-02T13:42:50Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.