A study of the impact of generative AI-based data augmentation on
software metadata classification
- URL: http://arxiv.org/abs/2310.13714v1
- Date: Sat, 14 Oct 2023 10:47:10 GMT
- Title: A study of the impact of generative AI-based data augmentation on
software metadata classification
- Authors: Tripti Kumari, Chakali Sai Charan and Ayan Das
- Abstract summary: We train a machine learning-based model using the neural contextual representations of the comments and their corresponding codes to predict the usefulness of code-comments pair.
In the official assessment, our system achieves a 4% increase in F1-score from baseline and the quality of generated data.
- Score: 1.1356542363919058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the system submitted by the team from IIT(ISM) Dhanbad in
FIRE IRSE 2023 shared task 1 on the automatic usefulness prediction of
code-comment pairs as well as the impact of Large Language Model(LLM) generated
data on original base data towards an associated source code. We have developed
a framework where we train a machine learning-based model using the neural
contextual representations of the comments and their corresponding codes to
predict the usefulness of code-comments pair and performance analysis with
LLM-generated data with base data. In the official assessment, our system
achieves a 4% increase in F1-score from baseline and the quality of generated
data.
Related papers
- DACO: Towards Application-Driven and Comprehensive Data Analysis via
Code Generation [86.4326416303723]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification [0.0]
We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language.
Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.
arXiv Detail & Related papers (2024-02-25T13:20:13Z) - Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z) - KAXAI: An Integrated Environment for Knowledge Analysis and Explainable
AI [0.0]
The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation.
The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability.
arXiv Detail & Related papers (2023-12-30T10:20:47Z) - Generative AI for Software Metadata: Overview of the Information
Retrieval in Software Engineering Track at FIRE 2023 [18.616716369775883]
The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments.
The dataset consists of 9048 code comments and surrounding code snippet pairs extracted from open source C based projects.
The labels generated from large language models increase the bias in the prediction model but lead to less over-fitted results.
arXiv Detail & Related papers (2023-10-27T14:13:23Z) - A ML-LLM pairing for better code comment classification [0.0]
We answer the code comment classification shared task challenge by providing a two-fold evaluation.
Our best model, which took second place in the shared task, is a Neural Network with a Macro-F1 score of 88.401% on the provided seed data.
arXiv Detail & Related papers (2023-10-13T12:43:13Z) - Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis [50.972595036856035]
We present a code that successfully replicates results from six popular and recent graph recommendation models.
We compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations.
By investigating the information flow from users' neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure.
arXiv Detail & Related papers (2023-08-01T09:31:44Z) - Tackling Computational Heterogeneity in FL: A Few Theoretical Insights [68.8204255655161]
We introduce and analyse a novel aggregation framework that allows for formalizing and tackling computational heterogeneous data.
Proposed aggregation algorithms are extensively analyzed from a theoretical, and an experimental prospective.
arXiv Detail & Related papers (2023-07-12T16:28:21Z) - Imputing Knowledge Tracing Data with Subject-Based Training via LSTM
Variational Autoencoders Frameworks [6.24828623162058]
We adopt a subject-based training method to split and impute data by student IDs instead of row number splitting.
We leverage two existing deep generative frameworks, namely variational Autoencoders (VAE) and Longitudinal Variational Autoencoders (LVAE)
We demonstrate that the generated data from LSTM-VAE and LSTM-LVAE can boost the original model performance by about 50%.
arXiv Detail & Related papers (2023-02-24T21:56:03Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.