Related papers: Iterative Rule Extension for Logic Analysis of Data: an MILP-based heuristic to derive interpretable binary classification from large datasets

Iterative Rule Extension for Logic Analysis of Data: an MILP-based heuristic to derive interpretable binary classification from large datasets

URL: http://arxiv.org/abs/2110.13664v1
Date: Mon, 25 Oct 2021 13:31:30 GMT
Title: Iterative Rule Extension for Logic Analysis of Data: an MILP-based heuristic to derive interpretable binary classification from large datasets
Authors: Marleen Balvert
Abstract summary: This work presents IRELAND, an algorithm that allows for abstracting Boolean phrases in DNF from data with up to 10,000 samples and sample characteristics. The results show that for large datasets IRELAND outperforms the current state-of-the-art and can find solutions for datasets where current models run out of memory or need excessive runtimes.
Score: 0.6526824510982799
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data-driven decision making is rapidly gaining popularity, fueled by the ever-increasing amounts of available data and encouraged by the development of models that can identify beyond linear input-output relationships. Simultaneously the need for interpretable prediction- and classification methods is increasing, as this improves both our trust in these models and the amount of information we can abstract from data. An important aspect of this interpretability is to obtain insight in the sensitivity-specificity trade-off constituted by multiple plausible input-output relationships. These are often shown in a receiver operating characteristic (ROC) curve. These developments combined lead to the need for a method that can abstract complex yet interpretable input-output relationships from large data, i.e. data containing large numbers of samples and sample features. Boolean phrases in disjunctive normal form (DNF) are highly suitable for explaining non-linear input-output relationships in a comprehensible way. Mixed integer linear programming (MILP) can be used to abstract these Boolean phrases from binary data, though its computational complexity prohibits the analysis of large datasets. This work presents IRELAND, an algorithm that allows for abstracting Boolean phrases in DNF from data with up to 10,000 samples and sample characteristics. The results show that for large datasets IRELAND outperforms the current state-of-the-art and can find solutions for datasets where current models run out of memory or need excessive runtimes. Additionally, by construction IRELAND allows for an efficient computation of the sensitivity-specificity trade-off curve, allowing for further understanding of the underlying input-output relationship.

Related papers

SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
The Cognate Data Bottleneck in Language Phylogenetics [49.1574468325115]
Phylogenetic data analysis approaches that require larger datasets can not be applied to cognate data.<n>It remains an open question how, and if these computational approaches can be applied in historical linguistics.
arXiv Detail & Related papers (2025-07-01T16:14:20Z)
An adaptive sampling algorithm for data-generation to build a data-manifold for physical problem surrogate modeling [0.7499722271664147]
We present an Adaptive Sampling Algorithm for Data Generation (ASADG) involving a physical model.<n>We demonstrate the efficiency of the data sampling algorithm in comparison with LHS method for generating more representative input data.
arXiv Detail & Related papers (2025-05-13T12:17:10Z)
Multivariate Temporal Regression at Scale: A Three-Pillar Framework Combining ML, XAI, and NLP [1.331812695405053]
This paper dives into the hurdles of analyzing high-dimensional data, especially when it gets too complex. Traditional methods in data analysis often look at direct connections between input variables, which can miss out on the more complicated relationships within the data. We consider the role of synthetic data and how information can sometimes be redundant across different sensors.
arXiv Detail & Related papers (2025-04-02T21:53:03Z)
Towards a Theoretical Understanding of Memorization in Diffusion Models [76.85077961718875]
Diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI) We provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. We propose a novel data extraction method named textbfSurrogate condItional Data Extraction (SIDE) that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs.
arXiv Detail & Related papers (2024-10-03T13:17:06Z)
Generative Expansion of Small Datasets: An Expansive Graph Approach [13.053285552524052]
We introduce an Expansive Synthesis model generating large-scale, information-rich datasets from minimal samples. An autoencoder with self-attention layers and optimal transport refines distributional consistency. Results show comparable performance, demonstrating the model's potential to augment training data effectively.
arXiv Detail & Related papers (2024-06-25T02:59:02Z)
$\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery [1.3048920509133808]
We build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. We employ distributional random forests to flexibly estimate and represent conditional distributions. Using the library, we showcase how to benchmark several well-known causal discovery algorithms.
arXiv Detail & Related papers (2023-06-19T10:05:54Z)
ALMERIA: Boosting pairwise molecular contrasts with scalable methods [0.0]
ALMERIA is a tool for estimating compound similarities and activity prediction based on pairwise molecular contrasts. It has been implemented using scalable software and methods to exploit large volumes of data. Experiments show state-of-the-art performance for molecular activity prediction.
arXiv Detail & Related papers (2023-04-28T16:27:06Z)
Targeted Analysis of High-Risk States Using an Oriented Variational Autoencoder [3.494548275937873]
Variational autoencoder (VAE) neural networks can be trained to generate power system states. The coordinates of the latent space codes of VAEs have been shown to correlate with conceptual features of the data. In this paper, an oriented variation autoencoder (OVAE) is proposed to constrain the link between latent space code and generated data.
arXiv Detail & Related papers (2023-03-20T19:34:21Z)
Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph Construction [57.854498238624366]
We propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP) for data-efficient knowledge graph construction. RAP can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample.
arXiv Detail & Related papers (2022-10-19T16:40:28Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
ARM-Net: Adaptive Relation Modeling Network for Structured Data [29.94433633729326]
ARM-Net is an adaptive relation modeling network tailored for structured data and a lightweight framework ARMOR based on ARM-Net for relational data. We show that ARM-Net consistently outperforms existing models and provides more interpretable predictions for datasets.
arXiv Detail & Related papers (2021-07-05T07:37:24Z)
Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts. We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z)
Learning summary features of time series for likelihood free inference [93.08098361687722]
We present a data-driven strategy for automatically learning summary features from time series data. Our results indicate that learning summary features from data can compete and even outperform LFI methods based on hand-crafted values.
arXiv Detail & Related papers (2020-12-04T19:21:37Z)
Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations. Our framework well preserves the relations between samples. By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.