CodeFuse-Query: A Data-Centric Static Code Analysis System for
Large-Scale Organizations
- URL: http://arxiv.org/abs/2401.01571v1
- Date: Wed, 3 Jan 2024 06:56:39 GMT
- Title: CodeFuse-Query: A Data-Centric Static Code Analysis System for
Large-Scale Organizations
- Authors: Xiaoheng Xie, Gang Fan, Xiaojun Lin, Ang Zhou, Shijie Li, Xunjin
Zheng, Yinan Liang, Yu Zhang, Na Yu, Haokun Li, Xinyu Chen, Yingzhuang Chen,
Yi Zhen, Dejun Dong, Xianjin Fu, Jinzhou Su, Fuxiong Pan, Pengshuai Luo,
Youzheng Feng, Ruoxiang Hu, Jing Fan, Jinguo Zhou, Xiao Xiao, Peng Di
- Abstract summary: CodeFuse-Query reimagines code analysis as a data computation task.
System supports scanning over 10 billion lines of code daily and more than 300 different tasks.
- Score: 21.688988418676878
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the domain of large-scale software development, the demands for dynamic
and multifaceted static code analysis exceed the capabilities of traditional
tools. To bridge this gap, we present CodeFuse-Query, a system that redefines
static code analysis through the fusion of Domain Optimized System Design and
Logic Oriented Computation Design.
CodeFuse-Query reimagines code analysis as a data computation task, support
scanning over 10 billion lines of code daily and more than 300 different tasks.
It optimizes resource utilization, prioritizes data reusability, applies
incremental code extraction, and introduces tasks types specially for Code
Change, underscoring its domain-optimized design. The system's logic-oriented
facet employs Datalog, utilizing a unique two-tiered schema, COREF, to convert
source code into data facts. Through Godel, a distinctive language,
CodeFuse-Query enables formulation of complex tasks as logical expressions,
harnessing Datalog's declarative prowess.
This paper provides empirical evidence of CodeFuse-Query's transformative
approach, demonstrating its robustness, scalability, and efficiency. We also
highlight its real-world impact and diverse applications, emphasizing its
potential to reshape the landscape of static code analysis in the context of
large-scale software development.Furthermore, in the spirit of collaboration
and advancing the field, our project is open-sourced and the repository is
available for public access
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Chain-of-Programming (CoP) : Empowering Large Language Models for Geospatial Code Generation [2.6026969939746705]
This paper proposes a Chain of Programming framework to decompose the code generation process into five steps.
The framework incorporates a shared information pool, knowledge base retrieval, and user feedback mechanisms.
It significantly improves the logical clarity, syntactical correctness, and executability of the generated code.
arXiv Detail & Related papers (2024-11-16T09:20:35Z) - Code-Survey: An LLM-Driven Methodology for Analyzing Large-Scale Codebases [3.8153349016958074]
We introduce Code-Survey, the first LLM-driven methodology designed to explore and analyze large-scales.
By carefully designing surveys, Code-Survey transforms unstructured data, such as commits, emails, into organized, structured, and analyzable datasets.
This enables quantitative analysis of complex software evolution and uncovers valuable insights related to design, implementation, maintenance, reliability, and security.
arXiv Detail & Related papers (2024-09-24T17:08:29Z) - Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes.
CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks.
It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z) - LAMBDA: A Large Model Based Data Agent [7.240586338370509]
We introduce LArge Model Based Data Agent (LAMBDA), a novel open-source, code-free multi-agent data analysis system.
LAMBDA is designed to address data analysis challenges in complex data-driven applications.
It has the potential to enhance data analysis paradigms by seamlessly integrating human and artificial intelligence.
arXiv Detail & Related papers (2024-07-24T06:26:36Z) - Enhancing Source Code Representations for Deep Learning with Static
Analysis [10.222207222039048]
This paper explores the integration of static analysis and additional context such as bug reports and design patterns into source code representations for deep learning models.
We use the Abstract Syntax Tree-based Neural Network (ASTNN) method and augment it with additional context information obtained from bug reports and design patterns.
Our approach improves the representation and processing of source code, thereby improving task performance.
arXiv Detail & Related papers (2024-02-14T20:17:04Z) - A Unified Active Learning Framework for Annotating Graph Data with
Application to Software Source Code Performance Prediction [4.572330678291241]
We develop a unified active learning framework specializing in software performance prediction.
We investigate the impact of using different levels of information for active and passive learning.
Our approach aims to improve the investment in AI models for different software performance predictions.
arXiv Detail & Related papers (2023-04-06T14:00:48Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.