OTClean: Data Cleaning for Conditional Independence Violations using
Optimal Transport
- URL: http://arxiv.org/abs/2403.02372v1
- Date: Mon, 4 Mar 2024 18:23:55 GMT
- Title: OTClean: Data Cleaning for Conditional Independence Violations using
Optimal Transport
- Authors: Alireza Pirhadi, Mohammad Hossein Moslemi, Alexander Cloninger,
Mostafa Milani, Babak Salimi
- Abstract summary: sys is a framework that harnesses optimal transport theory for data repair under Conditional Independence (CI) constraints.
We develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data.
- Score: 51.6416022358349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring Conditional Independence (CI) constraints is pivotal for the
development of fair and trustworthy machine learning models. In this paper, we
introduce \sys, a framework that harnesses optimal transport theory for data
repair under CI constraints. Optimal transport theory provides a rigorous
framework for measuring the discrepancy between probability distributions,
thereby ensuring control over data utility. We formulate the data repair
problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and
propose an alternating method for its solution. However, this approach faces
scalability issues due to the computational cost associated with computing
optimal transport distances, such as the Wasserstein distance. To overcome
these scalability challenges, we reframe our problem as a regularized
optimization problem, enabling us to develop an iterative algorithm inspired by
Sinkhorn's matrix scaling algorithm, which efficiently addresses
high-dimensional and large-scale data. Through extensive experiments, we
demonstrate the efficacy and efficiency of our proposed methods, showcasing
their practical utility in real-world data cleaning and preprocessing tasks.
Furthermore, we provide comparisons with traditional approaches, highlighting
the superiority of our techniques in terms of preserving data utility while
ensuring adherence to the desired CI constraints.
Related papers
- AdapFair: Ensuring Continuous Fairness for Machine Learning Operations [7.909259406397651]
We present a debiasing framework designed to find an optimal fair transformation of input data.
We leverage the normalizing flows to enable efficient, information-preserving data transformation.
We introduce an efficient optimization algorithm with closed-formed gradient computations.
arXiv Detail & Related papers (2024-09-23T15:01:47Z) - Integer Optimization of CT Trajectories using a Discrete Data
Completeness Formulation [3.924235219960689]
X-ray computed tomography plays a key role in digitizing three-dimensional structures for a wide range of medical and industrial applications.
Traditional CT systems often rely on standard circular and helical scan trajectories, which may not be optimal for challenging scenarios involving large objects, complex structures, or resource constraints.
We are exploring the potential of twin robotic CT systems, which offer the flexibility to acquire projections from arbitrary views around the object of interest.
arXiv Detail & Related papers (2024-01-29T10:38:58Z) - Large-Scale OD Matrix Estimation with A Deep Learning Method [70.78575952309023]
The proposed method integrates deep learning and numerical optimization algorithms to infer matrix structure and guide numerical optimization.
We conducted tests to demonstrate the good generalization performance of our method on a large-scale synthetic dataset.
arXiv Detail & Related papers (2023-10-09T14:30:06Z) - Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities.
We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability.
The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z) - Efficient Learning of Decision-Making Models: A Penalty Block Coordinate
Descent Algorithm for Data-Driven Inverse Optimization [12.610576072466895]
We consider the inverse problem where we use prior decision data to uncover the underlying decision-making process.
This statistical learning problem is referred to as data-driven inverse optimization.
We propose an efficient block coordinate descent-based algorithm to solve large problem instances.
arXiv Detail & Related papers (2022-10-27T12:52:56Z) - Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations [50.37808220291108]
This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations.
We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety.
We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior.
arXiv Detail & Related papers (2021-11-18T23:21:00Z) - Outlier-Robust Sparse Estimation via Non-Convex Optimization [73.18654719887205]
We explore the connection between high-dimensional statistics and non-robust optimization in the presence of sparsity constraints.
We develop novel and simple optimization formulations for these problems.
As a corollary, we obtain that any first-order method that efficiently converges to station yields an efficient algorithm for these tasks.
arXiv Detail & Related papers (2021-09-23T17:38:24Z) - Constrained Model-Free Reinforcement Learning for Process Optimization [0.0]
Reinforcement learning (RL) is a control approach that can handle nonlinear optimal control problems.
Despite the promise exhibited, RL has yet to see marked translation to industrial practice.
We propose an 'oracle'-assisted constrained Q-learning algorithm that guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-11-16T13:16:22Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.