InterCode: Standardizing and Benchmarking Interactive Coding with
Execution Feedback
- URL: http://arxiv.org/abs/2306.14898v3
- Date: Mon, 30 Oct 2023 17:52:18 GMT
- Title: InterCode: Standardizing and Benchmarking Interactive Coding with
Execution Feedback
- Authors: John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao
- Abstract summary: We introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning environment.
Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution.
We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies.
- Score: 50.725076393314964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans write code in a fundamentally interactive manner and rely on constant
execution feedback to correct errors, resolve ambiguities, and decompose tasks.
While LLMs have recently exhibited promising coding capabilities, current
coding benchmarks mostly consider a static instruction-to-code sequence
transduction process, which has the potential for error propagation and a
disconnect between the generated code and its final execution environment. To
address this gap, we introduce InterCode, a lightweight, flexible, and
easy-to-use framework of interactive coding as a standard reinforcement
learning (RL) environment, with code as actions and execution feedback as
observations. Our framework is language and platform agnostic, uses
self-contained Docker environments to provide safe and reproducible execution,
and is compatible out-of-the-box with traditional seq2seq coding methods, while
enabling the development of new methods for interactive code generation. We use
InterCode to create three interactive code environments with Bash, SQL, and
Python as action spaces, leveraging data from the static NL2Bash, Spider, and
MBPP datasets. We demonstrate InterCode's viability as a testbed by evaluating
multiple state-of-the-art LLMs configured with different prompting strategies
such as ReAct and Plan & Solve. Our results showcase the benefits of
interactive code generation and demonstrate that InterCode can serve as a
challenging benchmark for advancing code understanding and generation
capabilities. InterCode is designed to be easily extensible and can even be
used to create new tasks such as Capture the Flag, a popular coding puzzle that
is inherently multi-step and involves multiple programming languages. Project
site with code and data: https://intercode-benchmark.github.io
Related papers
- What can Large Language Models Capture about Code Functional Equivalence? [24.178831487657945]
We introduce SeqCoBench, a benchmark for assessing how Code-LLMs can capture code functional equivalence.
We conduct evaluations on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench.
arXiv Detail & Related papers (2024-08-20T11:19:06Z) - Executable Code Actions Elicit Better LLM Agents [76.95566120678787]
This work proposes to use Python code to consolidate Large Language Model (LLM) agents' actions into a unified action space (CodeAct)
integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.
The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language.
arXiv Detail & Related papers (2024-02-01T21:38:58Z) - CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation [18.354576598908448]
Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming.
Existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations.
We introduce CodeScope, an execution-based, multilingual, multitask, multidimensional evaluation benchmark.
arXiv Detail & Related papers (2023-11-14T23:18:52Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Leveraging Code Generation to Improve Code Retrieval and Summarization
via Dual Learning [18.354352985591305]
Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query.
Recent studies have combined these two tasks to improve their performance.
We propose a novel end-to-end model for the two tasks by introducing an additional code generation task.
arXiv Detail & Related papers (2020-02-24T12:26:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.