Bias and Error Mitigation in Software-Generated Data: An Advanced Search
and Optimization Framework Leveraging Generative Code Models
- URL: http://arxiv.org/abs/2310.11546v1
- Date: Tue, 17 Oct 2023 19:31:05 GMT
- Title: Bias and Error Mitigation in Software-Generated Data: An Advanced Search
and Optimization Framework Leveraging Generative Code Models
- Authors: Ernesto Giralt Hern\'andez
- Abstract summary: This paper proposes an advanced search and optimization framework aimed at generating and choosing optimal source code capable of correcting errors and biases from previous versions.
Applying this framework multiple times on the same software system would incrementally improve the quality of the output results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data generation and analysis is a fundamental aspect of many industries and
disciplines, from strategic decision making in business to research in the
physical and social sciences. However, data generated using software and
algorithms can be subject to biases and errors. These can be due to problems
with the original software, default settings that do not align with the
specific needs of the situation, or even deeper problems with the underlying
theories and models. This paper proposes an advanced search and optimization
framework aimed at generating and choosing optimal source code capable of
correcting errors and biases from previous versions to address typical problems
in software systems specializing in data analysis and generation, especially
those in the corporate and data science world. Applying this framework multiple
times on the same software system would incrementally improve the quality of
the output results. It uses Solomonoff Induction as a sound theoretical basis,
extending it with Kolmogorov Conditional Complexity, a novel adaptation, to
evaluate a set of candidate programs. We propose the use of generative models
for the creation of this set of programs, with special emphasis on the
capabilities of Large Language Models (LLMs) to generate high quality code.
Related papers
- Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points [51.40935517552926]
We introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards critical error-prone areas.
By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.
arXiv Detail & Related papers (2025-02-17T06:16:02Z) - Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score.
It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models.
Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z) - An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities [19.455889970335967]
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions.
One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code.
Retrieval-augmented framework can be leveraged to help understand the requirements and provide guidance for the generation process.
arXiv Detail & Related papers (2025-01-23T15:17:51Z) - SDPERL: A Framework for Software Defect Prediction Using Ensemble Feature Extraction and Reinforcement Learning [0.0]
This paper proposes an innovative framework for software defect prediction.
It combines ensemble feature extraction with reinforcement learning (RL)--based feature selection.
We claim that this work is among the first in recent efforts to address this challenge at the file-level granularity.
arXiv Detail & Related papers (2024-12-10T21:16:05Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Investigating Reproducibility in Deep Learning-Based Software Fault
Prediction [16.25827159504845]
With the rapid adoption of increasingly complex machine learning models, it becomes more and more difficult for scholars to reproduce the results that are reported in the literature.
This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared.
We have conducted a systematic review of the current literature and examined the level of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences.
arXiv Detail & Related papers (2024-02-08T13:00:18Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Applications of Nature-Inspired Metaheuristic Algorithms for Tackling Optimization Problems Across Disciplines [12.664160352147293]
This paper demonstrates the usefulness of nature-inspired metaheuristic algorithms for solving a variety of challenging optimization problems in statistics.
The main goal of this paper is to show a typical metaheuristic algorithmi, like CSO-MA, is efficient for tackling many different types of optimization problems in statistics.
arXiv Detail & Related papers (2023-08-08T16:41:33Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in
Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods.
Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.