Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning
- URL: http://arxiv.org/abs/2601.21894v1
- Date: Thu, 29 Jan 2026 15:54:40 GMT
- Title: Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning
- Authors: Lukas Twist, Shu Yang, Hanqi Yan, Jingzhi Gong, Di Wang, Helen Yannakoudakis, Jie M. Zhang,
- Abstract summary: Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning.<n>Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal.<n>We study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning.
- Score: 16.919028520729793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning. We examine two complementary settings: solution-driven complexity, where complexity varies across multiple solutions to the same problem, and problem-driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets, we evaluate a range of open-weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data-centric path for improving reasoning beyond scaling.
Related papers
- CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis [14.328535883908176]
We present CodeGlance, a benchmark investigating code reasoning challenges across three realistic scenarios.<n>We find that unseen function reasoning poses significant challenges especially for smaller models.<n>We identify critical code complexity features that significantly impact code reasoning difficulty across scenarios.
arXiv Detail & Related papers (2026-02-15T02:46:51Z) - Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization [8.236500918322138]
We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework to define and measure reasoning.<n>A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity exceeds that of all training examples.<n>We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack.
arXiv Detail & Related papers (2025-10-06T13:08:31Z) - Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [53.00384299879513]
In large language models (LLMs), code and reasoning reinforce each other.<n>Code provides verifiable execution paths, enforces logical decomposition, and enables runtime validation.<n>We identify key challenges and propose future research directions to strengthen this synergy.
arXiv Detail & Related papers (2025-02-26T18:55:42Z) - Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem.
Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints.
We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z) - Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning [89.89857766491475]
We propose a curriculum-based logical-aware instruction tuning framework, named LACT.<n>Specifically, we augment the arbitrary first-order logical queries via binary tree decomposition.<n> Experiments across widely used datasets demonstrate that LACT has substantial improvements(brings an average +5.5% MRR score) over advanced methods, achieving the new state-of-the-art.
arXiv Detail & Related papers (2024-05-02T18:12:08Z) - Parrot Mind: Towards Explaining the Complex Task Reasoning of Pretrained Large Language Models with Template-Content Structure [66.33623392497599]
We show that a structure called template-content structure (T-C structure) can reduce the possible space from exponential level to linear level.
We demonstrate that models can achieve task composition, further reducing the space needed to learn from linear to logarithmic.
arXiv Detail & Related papers (2023-10-09T06:57:45Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z) - Query Structure Modeling for Inductive Logical Reasoning Over Knowledge
Graphs [67.043747188954]
We propose a structure-modeled textual encoding framework for inductive logical reasoning over KGs.
It encodes linearized query structures and entities using pre-trained language models to find answers.
We conduct experiments on two inductive logical reasoning datasets and three transductive datasets.
arXiv Detail & Related papers (2023-05-23T01:25:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.