Enhancing Binary Code Comment Quality Classification: Integrating
Generative AI for Improved Accuracy
- URL: http://arxiv.org/abs/2310.11467v1
- Date: Sat, 14 Oct 2023 18:19:06 GMT
- Title: Enhancing Binary Code Comment Quality Classification: Integrating
Generative AI for Improved Accuracy
- Authors: Rohith Arumugam S, Angel Deborah S
- Abstract summary: This report focuses on enhancing a binary code comment quality classification model by integrating generated code and comment pairs.
The dataset comprises 9048 pairs of code and comments written in the C programming language, each annotated as "Useful" or "Not Useful"
The outcome of this effort consists of two classification models: one utilizing the original dataset and another incorporating the augmented dataset with the newly generated code comment pairs and labels.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report focuses on enhancing a binary code comment quality classification
model by integrating generated code and comment pairs, to improve model
accuracy. The dataset comprises 9048 pairs of code and comments written in the
C programming language, each annotated as "Useful" or "Not Useful."
Additionally, code and comment pairs are generated using a Large Language Model
Architecture, and these generated pairs are labeled to indicate their utility.
The outcome of this effort consists of two classification models: one utilizing
the original dataset and another incorporating the augmented dataset with the
newly generated code comment pairs and labels.
Related papers
- Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation [5.2510537676167335]
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages.
Our evaluations show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet.
arXiv Detail & Related papers (2023-05-09T09:35:03Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - LAMNER: Code Comment Generation Using Character Language Model and Named
Entity Recognition [0.7894331610810762]
We present LAnguage Model and Named Entity Recognition (LAMNER)
LAMNER is a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token.
We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics.
arXiv Detail & Related papers (2022-04-05T20:53:06Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.