Related papers: Bias and Error Mitigation in Software-Generated Data: An Advanced Search and Optimization Framework Leveraging Generative Code Models

Bias and Error Mitigation in Software-Generated Data: An Advanced Search and Optimization Framework Leveraging Generative Code Models

URL: http://arxiv.org/abs/2310.11546v1
Date: Tue, 17 Oct 2023 19:31:05 GMT
Title: Bias and Error Mitigation in Software-Generated Data: An Advanced Search and Optimization Framework Leveraging Generative Code Models
Authors: Ernesto Giralt Hern\'andez
Abstract summary: This paper proposes an advanced search and optimization framework aimed at generating and choosing optimal source code capable of correcting errors and biases from previous versions. Applying this framework multiple times on the same software system would incrementally improve the quality of the output results.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data generation and analysis is a fundamental aspect of many industries and disciplines, from strategic decision making in business to research in the physical and social sciences. However, data generated using software and algorithms can be subject to biases and errors. These can be due to problems with the original software, default settings that do not align with the specific needs of the situation, or even deeper problems with the underlying theories and models. This paper proposes an advanced search and optimization framework aimed at generating and choosing optimal source code capable of correcting errors and biases from previous versions to address typical problems in software systems specializing in data analysis and generation, especially those in the corporate and data science world. Applying this framework multiple times on the same software system would incrementally improve the quality of the output results. It uses Solomonoff Induction as a sound theoretical basis, extending it with Kolmogorov Conditional Complexity, a novel adaptation, to evaluate a set of candidate programs. We propose the use of generative models for the creation of this set of programs, with special emphasis on the capabilities of Large Language Models (LLMs) to generate high quality code.

Related papers

From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets [19.140541190998842]
Software defect datasets are collections of software bugs and their associated information. Over the years, numerous software defect datasets have been developed, providing rich resources for the community. This article provides a comprehensive survey of 132 software defect datasets.
arXiv Detail & Related papers (2025-04-24T23:07:04Z)
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks [12.242406234298443]
We develop a taxonomy of potentially harmful software engineering scenarios. We then create a dataset of prompts based on this taxonomy. We investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content.
arXiv Detail & Related papers (2025-04-02T16:00:14Z)
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points [51.40935517552926]
We introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards critical error-prone areas. By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.
arXiv Detail & Related papers (2025-02-17T06:16:02Z)
An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities [19.455889970335967]
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code. Retrieval-augmented framework can be leveraged to help understand the requirements and provide guidance for the generation process.
arXiv Detail & Related papers (2025-01-23T15:17:51Z)
SDPERL: A Framework for Software Defect Prediction Using Ensemble Feature Extraction and Reinforcement Learning [0.0]
This paper proposes an innovative framework for software defect prediction. It combines ensemble feature extraction with reinforcement learning (RL)--based feature selection. We claim that this work is among the first in recent efforts to address this challenge at the file-level granularity.
arXiv Detail & Related papers (2024-12-10T21:16:05Z)
Chain-of-Programming (CoP) : Empowering Large Language Models for Geospatial Code Generation [2.6026969939746705]
This paper proposes a Chain of Programming framework to decompose the code generation process into five steps. The framework incorporates a shared information pool, knowledge base retrieval, and user feedback mechanisms. It significantly improves the logical clarity, syntactical correctness, and executability of the generated code.
arXiv Detail & Related papers (2024-11-16T09:20:35Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling [15.673219028826173]
We introduce a semi-automated data synthesis framework designed for optimization modeling issues, named OR-Instruct. We train various open-source LLMs with a capacity of 7 billion parameters (dubbed ORLMs) The resulting model demonstrates significantly enhanced optimization modeling capabilities, achieving state-of-the-art performance across the NL4OPT, MAMO, and IndustryOR benchmarks.
arXiv Detail & Related papers (2024-05-28T01:55:35Z)
Investigating Reproducibility in Deep Learning-Based Software Fault Prediction [16.25827159504845]
With the rapid adoption of increasingly complex machine learning models, it becomes more and more difficult for scholars to reproduce the results that are reported in the literature. This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared. We have conducted a systematic review of the current literature and examined the level of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences.
arXiv Detail & Related papers (2024-02-08T13:00:18Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
Applications of Nature-Inspired Metaheuristic Algorithms for Tackling Optimization Problems Across Disciplines [12.664160352147293]
This paper demonstrates the usefulness of nature-inspired metaheuristic algorithms for solving a variety of challenging optimization problems in statistics. The main goal of this paper is to show a typical metaheuristic algorithmi, like CSO-MA, is efficient for tackling many different types of optimization problems in statistics.
arXiv Detail & Related papers (2023-08-08T16:41:33Z)
PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z)
SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation [101.22379613810881]
We consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. This problem setting emerges in many domains where function evaluation is a complex and expensive process. We propose a tractable approximation that allows us to scale our method to high-capacity neural network models.
arXiv Detail & Related papers (2021-02-16T06:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.