CodeComplex: A Time-Complexity Dataset for Bilingual Source Codes
- URL: http://arxiv.org/abs/2401.08719v1
- Date: Tue, 16 Jan 2024 06:54:44 GMT
- Title: CodeComplex: A Time-Complexity Dataset for Bilingual Source Codes
- Authors: Seung-Yeop Baik, Mingi Jeon, Joonghyuk Hahn, Jungin Kim, Yo-Sub Han,
Sang-Ki Ko
- Abstract summary: We introduce CodeComplex, a novel source code dataset where each code is manually annotated with a corresponding worst-case time complexity.
To the best of our knowledge, CodeComplex stands as the most extensive code dataset tailored for predicting complexity.
We present the outcomes of our experiments employing various baseline models, leveraging state-of-the-art neural models in code comprehension.
- Score: 6.169110187130671
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Analyzing the worst-case time complexity of a code is a crucial task in
computer science and software engineering for ensuring the efficiency,
reliability, and robustness of software systems. However, it is well-known that
the problem of determining the worst-case time complexity of a given code
written in general-purpose programming language is theoretically undecidable by
the famous Halting problem proven by Alan Turing. Thus, we move towards more
realistic scenarios where the inputs and outputs of a program exist. This
allows us to discern the correctness of given codes, challenging to analyze
their time complexity exhaustively. In response to this challenge, we introduce
CodeComplex, a novel source code dataset where each code is manually annotated
with a corresponding worst-case time complexity. CodeComplex comprises 4,900
Java codes and an equivalent number of Python codes, all sourced from
programming competitions and annotated with complexity labels by a panel of
algorithmic experts. To the best of our knowledge, CodeComplex stands as the
most extensive code dataset tailored for predicting complexity. Subsequently,
we present the outcomes of our experiments employing various baseline models,
leveraging state-of-the-art neural models in code comprehension like CodeBERT,
GraphCodeBERT, UniXcoder, PLBART, CodeT5, CodeT5+, and ChatGPT. We analyze how
the dataset impacts the model's learning in predicting time complexity.
Related papers
- Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - MapCoder: Multi-Agent Code Generation for Competitive Problem Solving [3.3856216159724983]
We introduce a new approach to code generation tasks leveraging multi-agent prompting.
Our framework, MapCoder, consists of four LLM agents specifically designed to emulate the stages of program synthesis.
Our method consistently delivers superior performance across various programming languages.
arXiv Detail & Related papers (2024-05-18T22:10:15Z) - CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing [51.00909683314142]
Large Language Models have revolutionized code generation ability by converting natural language descriptions into executable code.
CoCoST framework enhances complex code generation by online searching for more information with planned queries and correctness testing for code refinement.
CoCoST is validated through rigorous experiments on the DS-1000 and ClassEval datasets.
arXiv Detail & Related papers (2024-03-20T13:33:55Z) - Automatizing Software Cognitive Complexity Reduction through Integer
Linear Programming [1.1970409518725493]
Recently, we modeled software cognitive complexity reduction as an optimization problem and we proposed an approach to assist developers on this task.
This approach enumerates sequences of code extraction operations until a stopping criterion is met. As a result, it returns the minimal sequence of code extraction operations that is able to reduce the cognitive complexity of a code to the given threshold.
arXiv Detail & Related papers (2024-02-08T10:53:00Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z) - TASTY: A Transformer based Approach to Space and Time complexity [0.4724825031148411]
Code based Language Models (LMs) have shown very promising results in the field of software engineering.
We create a labelled dataset of code snippets spanning multiple languages.
We propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so.
arXiv Detail & Related papers (2023-05-06T03:37:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Competition-Level Code Generation with AlphaCode [74.87216298566942]
We introduce AlphaCode, a system for code generation that can create novel solutions to problems that require deeper reasoning.
In simulated evaluations on recent programming competitions on the Codeforces platform, AlphaCode achieved on average a ranking of top 54.3%.
arXiv Detail & Related papers (2022-02-08T23:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.