Is Compression Really Linear with Code Intelligence?
- URL: http://arxiv.org/abs/2505.11441v3
- Date: Wed, 04 Jun 2025 06:01:48 GMT
- Title: Is Compression Really Linear with Code Intelligence?
- Authors: Xianzhen Luo, Shijie Xuyang, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che,
- Abstract summary: textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
- Score: 60.123628177110206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs' code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve's tail under specific, limited conditions. Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
Related papers
- From Reasoning to Code: GRPO Optimization for Underrepresented Languages [0.7864304771129751]
This paper introduces a generalizable approach that uses small-scale code versions of the Qwen 2.5 model combined with Group Relative Policy Optimization.<n>It produces logically consistent and syntactically accurate code by directly integrating reasoning-driven feedback into the reinforcement learning loop.
arXiv Detail & Related papers (2025-05-20T11:28:48Z) - On Explaining (Large) Language Models For Code Using Global Code-Based Explanations [45.126233498200534]
Language Models for Code (LLM4Code) have significantly changed the landscape of software engineering (SE)<n>We introduce code rationales (Code$Q$), a technique with rigorous mathematical underpinning, to identify subsets of tokens that can explain individual code predictions.<n>Our evaluation demonstrates that Code$Q$ is a powerful interpretability method to explain how (less) meaningful input concepts (i.e., natural language particle at') highly impact output generation.
arXiv Detail & Related papers (2025-03-21T01:00:45Z) - SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors [5.247363735860479]
Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks.<n>Given LLMs' ability to understand and process diverse programs, they present a promising direction for building general-purpose surrogate models.<n>We introduce SURGE, a benchmark with $1160$ problems covering $8$ key aspects.<n>Through empirical analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy.
arXiv Detail & Related papers (2025-02-16T15:38:19Z) - Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score.<n>It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models.<n>Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.<n>While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.<n>We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - Benchmarking and Explaining Large Language Model-based Code Generation:
A Causality-Centric Approach [12.214585409361126]
Large language models (LLMs)- based code generation is a complex and powerful black-box model.
We propose a novel causal graph-based representation of the prompt and the generated code.
We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies.
arXiv Detail & Related papers (2023-10-10T14:56:26Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.