Related papers: NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification

URL: http://arxiv.org/abs/2402.16910v2
Date: Fri, 24 May 2024 07:11:17 GMT
Title: NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification
Authors: Hanna Abi Akl,
Abstract summary: We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow fixes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.

Related papers

Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs [1.8838588087156363]
This paper investigates the fine-tuning of code-generating Large Language Models (LLMs) We enhance the training data for the reward model with the help of symbolic execution techniques. Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL.
arXiv Detail & Related papers (2025-04-21T16:29:07Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [59.60208063956459]
Large Language Models (LLMs) require high quality instruction data for effective alignment. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions.
arXiv Detail & Related papers (2024-07-29T20:42:59Z)
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models. We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks. We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
A study of the impact of generative AI-based data augmentation on software metadata classification [1.1356542363919058]
We train a machine learning-based model using the neural contextual representations of the comments and their corresponding codes to predict the usefulness of code-comments pair. In the official assessment, our system achieves a 4% increase in F1-score from baseline and the quality of generated data.
arXiv Detail & Related papers (2023-10-14T10:47:10Z)
A ML-LLM pairing for better code comment classification [0.0]
We answer the code comment classification shared task challenge by providing a two-fold evaluation. Our best model, which took second place in the shared task, is a Neural Network with a Macro-F1 score of 88.401% on the provided seed data.
arXiv Detail & Related papers (2023-10-13T12:43:13Z)
Assemble Foundation Models for Automatic Code Summarization [9.53949558569201]
We propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. We introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning.
arXiv Detail & Related papers (2022-01-13T21:38:33Z)
Self-Supervised Class Incremental Learning [51.62542103481908]
Existing Class Incremental Learning (CIL) methods are based on a supervised classification framework sensitive to data labels. When updating them based on the new class data, they suffer from catastrophic forgetting: the model cannot discern old class data clearly from the new. In this paper, we explore the performance of Self-Supervised representation learning in Class Incremental Learning (SSCIL) for the first time.
arXiv Detail & Related papers (2021-11-18T06:58:19Z)
Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning [134.77207192945053]
Prior methods learn the neural-symbolic models using reinforcement learning approaches. We introduce the textbfgrammar model as a textitsymbolic prior to bridge neural perception and symbolic reasoning. We propose a novel textbfback-search algorithm which mimics the top-down human-like learning procedure to propagate the error.
arXiv Detail & Related papers (2020-06-11T17:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.