CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of
Code and Text
- URL: http://arxiv.org/abs/2403.01784v1
- Date: Mon, 4 Mar 2024 07:26:07 GMT
- Title: CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of
Code and Text
- Authors: Zhenru Lin, Yiqun Yao, Yang Yuan
- Abstract summary: Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text.
We present an automatic evaluation framework called $textbfCatCode$ that can comprehensively assess the coding abilities of LLMs.
- Score: 11.872260531587692
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) such as ChatGPT are increasingly proficient in
understanding and generating a mixture of code and text. Evaluation based on
such $\textit{mixture}$ can lead to a more comprehensive understanding of the
models' abilities in solving coding problems. However, in this context, current
evaluation methods are either limited in task coverage or lack standardization.
To address this issue, we propose using category theory as a framework for
evaluation. Specifically, morphisms within a code category can represent code
debugging and transformation, functors between two categories represent code
translation, and functors between a code category and a natural language
category represent code generation, explanation, and reproduction. We present
an automatic evaluation framework called $\textbf{CatCode}$
($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the
coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.
Related papers
- What can Large Language Models Capture about Code Functional Equivalence? [24.178831487657945]
We introduce SeqCoBench, a benchmark for assessing how Code-LLMs can capture code functional equivalence.
We conduct evaluations on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench.
arXiv Detail & Related papers (2024-08-20T11:19:06Z) - Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models [28.295926947968574]
Large language models (LLMs) have brought a paradigm shift to the field of code generation.
We empirically analyze the differences in coding style between the code generated by Code LLMs and the code written by human developers.
arXiv Detail & Related papers (2024-06-29T14:56:11Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Enhancing Repository-Level Code Generation with Integrated Contextual Information [8.58692613099365]
CatCoder is a novel code generation framework designed for statically typed programming languages.
CatCoder enhances repository-level code generation by integrating relevant code and type context.
Results show that CatCoder outperforms the RepoCoder baseline by up to 17.35%, in terms of pass@k score.
arXiv Detail & Related papers (2024-06-05T13:56:42Z) - CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.
We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations.
We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z) - Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy.
Results indicate that MANGO significantly improves the code pass rate based on the strong baselines.
The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z) - CodeQueries: A Dataset of Semantic Queries over Code [7.0864879068510005]
We contribute a labeled dataset, called CodeQueries, of semantic queries over Python code.
Compared to the existing datasets, in CodeQueries, the queries are about code semantics, the context is file level and the answers are code spans.
We evaluate a large language model (GPT3.5-Turbo) in zero-shot and few-shot settings on a subset of CodeQueries.
arXiv Detail & Related papers (2022-09-17T17:09:30Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.