Related papers: OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

URL: http://arxiv.org/abs/2403.02372v1
Date: Mon, 4 Mar 2024 18:23:55 GMT
Title: OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport
Authors: Alireza Pirhadi, Mohammad Hossein Moslemi, Alexander Cloninger, Mostafa Milani, Babak Salimi
Abstract summary: sys is a framework that harnesses optimal transport theory for data repair under Conditional Independence (CI) constraints. We develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data.
Score: 51.6416022358349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.

Related papers

Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
Decentralized Inference for Spatial Data Using Low-Rank Models [4.168323530566095]
This paper presents a decentralized framework tailored for parameter inference in spatial low-rank models. A key obstacle arises from the spatial dependence among observations, which prevents the log-likelihood from being expressed as a summation. Our approach employs a block descent method integrated with multi-consensus and dynamic consensus averaging for effective parameter optimization.
arXiv Detail & Related papers (2025-02-01T04:17:01Z)
AdapFair: Ensuring Continuous Fairness for Machine Learning Operations [7.909259406397651]
We present a debiasing framework designed to find an optimal fair transformation of input data. We leverage the normalizing flows to enable efficient, information-preserving data transformation. We introduce an efficient optimization algorithm with closed-formed gradient computations.
arXiv Detail & Related papers (2024-09-23T15:01:47Z)
Integer Optimization of CT Trajectories using a Discrete Data Completeness Formulation [3.924235219960689]
X-ray computed tomography plays a key role in digitizing three-dimensional structures for a wide range of medical and industrial applications. Traditional CT systems often rely on standard circular and helical scan trajectories, which may not be optimal for challenging scenarios involving large objects, complex structures, or resource constraints. We are exploring the potential of twin robotic CT systems, which offer the flexibility to acquire projections from arbitrary views around the object of interest.
arXiv Detail & Related papers (2024-01-29T10:38:58Z)
Large-Scale OD Matrix Estimation with A Deep Learning Method [70.78575952309023]
The proposed method integrates deep learning and numerical optimization algorithms to infer matrix structure and guide numerical optimization. We conducted tests to demonstrate the good generalization performance of our method on a large-scale synthetic dataset.
arXiv Detail & Related papers (2023-10-09T14:30:06Z)
Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities. We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability. The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z)
Efficient Learning of Decision-Making Models: A Penalty Block Coordinate Descent Algorithm for Data-Driven Inverse Optimization [12.610576072466895]
We consider the inverse problem where we use prior decision data to uncover the underlying decision-making process. This statistical learning problem is referred to as data-driven inverse optimization. We propose an efficient block coordinate descent-based algorithm to solve large problem instances.
arXiv Detail & Related papers (2022-10-27T12:52:56Z)
Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations [50.37808220291108]
This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations. We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety. We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior.
arXiv Detail & Related papers (2021-11-18T23:21:00Z)
Outlier-Robust Sparse Estimation via Non-Convex Optimization [73.18654719887205]
We explore the connection between high-dimensional statistics and non-robust optimization in the presence of sparsity constraints. We develop novel and simple optimization formulations for these problems. As a corollary, we obtain that any first-order method that efficiently converges to station yields an efficient algorithm for these tasks.
arXiv Detail & Related papers (2021-09-23T17:38:24Z)
Constrained Model-Free Reinforcement Learning for Process Optimization [0.0]
Reinforcement learning (RL) is a control approach that can handle nonlinear optimal control problems. Despite the promise exhibited, RL has yet to see marked translation to industrial practice. We propose an 'oracle'-assisted constrained Q-learning algorithm that guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-11-16T13:16:22Z)
Combining Deep Learning and Optimization for Security-Constrained Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems. Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs. This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.