COMEX: A Tool for Generating Customized Source Code Representations
- URL: http://arxiv.org/abs/2307.04693v1
- Date: Mon, 10 Jul 2023 16:46:34 GMT
- Title: COMEX: A Tool for Generating Customized Source Code Representations
- Authors: Debeshee Das, Noble Saji Mathews, Alex Mathai, Srikanth Tamilselvam,
Kranthi Sedamaki, Sridhar Chimalakonda and Atul Kumar
- Abstract summary: COMEX is a framework that allows researchers and developers to create and combine multiple code-views.
It can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural snippets.
It is built on tree-sitter - a widely used incremental analysis tool that supports over 40 languages.
- Score: 7.151800146054561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning effective representations of source code is critical for any Machine
Learning for Software Engineering (ML4SE) system. Inspired by natural language
processing, large language models (LLMs) like Codex and CodeGen treat code as
generic sequences of text and are trained on huge corpora of code data,
achieving state of the art performance on several software engineering (SE)
tasks. However, valid source code, unlike natural language, follows a strict
structure and pattern governed by the underlying grammar of the programming
language. Current LLMs do not exploit this property of the source code as they
treat code like a sequence of tokens and overlook key structural and semantic
properties of code that can be extracted from code-views like the Control Flow
Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc.
Unfortunately, the process of generating and integrating code-views for every
programming language is cumbersome and time consuming. To overcome this
barrier, we propose our tool COMEX - a framework that allows researchers and
developers to create and combine multiple code-views which can be used by
machine learning (ML) models for various SE tasks. Some salient features of our
tool are: (i) it works directly on source code (which need not be compilable),
(ii) it currently supports Java and C#, (iii) it can analyze both method-level
snippets and program-level snippets by using both intra-procedural and
inter-procedural analysis, and (iv) it is easily extendable to other languages
as it is built on tree-sitter - a widely used incremental parser that supports
over 40 languages. We believe this easy-to-use code-view generation and
customization tool will give impetus to research in source code representation
learning methods and ML4SE.
Tool: https://pypi.org/project/comex - GitHub:
https://github.com/IBM/tree-sitter-codeviews - Demo:
https://youtu.be/GER6U87FVbU
Related papers
- Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights [9.414198519543564]
We present codellm-devkit (hereafter, CLDK'), an open-source library that significantly simplifies the process of performing program analysis.
CLDK offers developers an intuitive and user-friendly interface, making it incredibly easy to provide rich program analysis context to code LLMs.
arXiv Detail & Related papers (2024-10-16T20:05:59Z) - Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages [1.559169421643164]
Node-based programming languages are increasingly popular in media arts coding domains.
Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity.
Best strategy for code generation for visual node-based programming languages is still an open question.
arXiv Detail & Related papers (2024-09-01T22:11:23Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - CodeLens: An Interactive Tool for Visualizing Code Representations [12.59741038895472]
Representing source code in a generic input format is crucial to automate software engineering tasks.
Visualizing code representations can further enable human experts to gain an intuitive insight into the code.
We introduce a tool, CodeLens, which provides a visual interaction environment that supports various representation methods.
arXiv Detail & Related papers (2023-07-27T14:46:09Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.