Bias and Error Mitigation in Software-Generated Data: An Advanced Search
and Optimization Framework Leveraging Generative Code Models
- URL: http://arxiv.org/abs/2310.11546v1
- Date: Tue, 17 Oct 2023 19:31:05 GMT
- Title: Bias and Error Mitigation in Software-Generated Data: An Advanced Search
and Optimization Framework Leveraging Generative Code Models
- Authors: Ernesto Giralt Hern\'andez
- Abstract summary: This paper proposes an advanced search and optimization framework aimed at generating and choosing optimal source code capable of correcting errors and biases from previous versions.
Applying this framework multiple times on the same software system would incrementally improve the quality of the output results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data generation and analysis is a fundamental aspect of many industries and
disciplines, from strategic decision making in business to research in the
physical and social sciences. However, data generated using software and
algorithms can be subject to biases and errors. These can be due to problems
with the original software, default settings that do not align with the
specific needs of the situation, or even deeper problems with the underlying
theories and models. This paper proposes an advanced search and optimization
framework aimed at generating and choosing optimal source code capable of
correcting errors and biases from previous versions to address typical problems
in software systems specializing in data analysis and generation, especially
those in the corporate and data science world. Applying this framework multiple
times on the same software system would incrementally improve the quality of
the output results. It uses Solomonoff Induction as a sound theoretical basis,
extending it with Kolmogorov Conditional Complexity, a novel adaptation, to
evaluate a set of candidate programs. We propose the use of generative models
for the creation of this set of programs, with special emphasis on the
capabilities of Large Language Models (LLMs) to generate high quality code.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Investigating Reproducibility in Deep Learning-Based Software Fault
Prediction [16.25827159504845]
With the rapid adoption of increasingly complex machine learning models, it becomes more and more difficult for scholars to reproduce the results that are reported in the literature.
This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared.
We have conducted a systematic review of the current literature and examined the level of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences.
arXiv Detail & Related papers (2024-02-08T13:00:18Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Applications of Nature-Inspired Metaheuristic Algorithms for Tackling Optimization Problems Across Disciplines [12.664160352147293]
This paper demonstrates the usefulness of nature-inspired metaheuristic algorithms for solving a variety of challenging optimization problems in statistics.
The main goal of this paper is to show a typical metaheuristic algorithmi, like CSO-MA, is efficient for tackling many different types of optimization problems in statistics.
arXiv Detail & Related papers (2023-08-08T16:41:33Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in
Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods.
Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Detecting Requirements Smells With Deep Learning: Experiences,
Challenges and Future Work [9.44316959798363]
This work aims to improve the previous work by creating a manually labeled dataset and using ensemble learning, Deep Learning (DL), and techniques such as word embeddings and transfer learning to overcome the generalization problem.
The current findings show that the dataset is unbalanced and which class examples should be added more.
arXiv Detail & Related papers (2021-08-06T12:45:15Z) - Offline Model-Based Optimization via Normalized Maximum Likelihood
Estimation [101.22379613810881]
We consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points.
This problem setting emerges in many domains where function evaluation is a complex and expensive process.
We propose a tractable approximation that allows us to scale our method to high-capacity neural network models.
arXiv Detail & Related papers (2021-02-16T06:04:27Z) - Software Defect Prediction Based On Deep Learning Models: Performance
Study [0.5735035463793008]
Two deep learning models, Stack Sparse Auto-Encoder (SSAE) and Deep Belief Network (DBN) are deployed to classify NASA datasets.
According to the conducted experiment, the accuracy for the datasets with sufficient samples is enhanced.
arXiv Detail & Related papers (2020-04-02T06:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.