Related papers: CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

URL: http://arxiv.org/abs/2403.01784v1
Date: Mon, 4 Mar 2024 07:26:07 GMT
Title: CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text
Authors: Zhenru Lin, Yiqun Yao, Yang Yuan
Abstract summary: Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. We present an automatic evaluation framework called $textbfCatCode$ that can comprehensively assess the coding abilities of LLMs.
Score: 11.872260531587692
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such $\textit{mixture}$ can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called $\textbf{CatCode}$ ($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.

Related papers

CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models [10.108114365564926]
State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models' ability to bridge both technical and conversational contexts.
arXiv Detail & Related papers (2025-03-20T14:07:31Z)
CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt [3.2047924365529026]
We propose the Code Completion Prompt (CoCoP) method, which transforms the text classification problem into a code completion task. CoCoP significantly improves text classification performance across diverse datasets by utilizing LLMs' code-completion capability.
arXiv Detail & Related papers (2024-11-13T19:12:02Z)
What can Large Language Models Capture about Code Functional Equivalence? [24.178831487657945]
We introduce SeqCoBench, a benchmark for assessing how Code-LLMs can capture code functional equivalence. We conduct evaluations on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench.
arXiv Detail & Related papers (2024-08-20T11:19:06Z)
Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models [28.295926947968574]
Large language models (LLMs) have brought a paradigm shift to the field of code generation. We empirically analyze the differences in coding style between the code generated by Code LLMs and the code written by human developers.
arXiv Detail & Related papers (2024-06-29T14:56:11Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
Enhancing Repository-Level Code Generation with Integrated Contextual Information [8.58692613099365]
CatCoder is a novel code generation framework designed for statically typed programming languages. CatCoder enhances repository-level code generation by integrating relevant code and type context. Results show that CatCoder outperforms the RepoCoder baseline by up to 17.35%, in terms of pass@k score.
arXiv Detail & Related papers (2024-06-05T13:56:42Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy. Results indicate that MANGO significantly improves the code pass rate based on the strong baselines. The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.