Reinforced Approximate Exploratory Data Analysis
- URL: http://arxiv.org/abs/2212.06225v1
- Date: Mon, 12 Dec 2022 20:20:22 GMT
- Title: Reinforced Approximate Exploratory Data Analysis
- Authors: Shaddy Garg, Subrata Mitra, Tong Yu, Yash Gadhia, Arjun Kashettiwar
- Abstract summary: We are first to consider the impact of sampling in interactive data exploration settings as they introduce approximation errors.
We propose a Deep Reinforcement Learning (DRL) based framework which can optimize the sample selection in order to keep the analysis and insight generation flow intact.
- Score: 7.974685452145769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploratory data analytics (EDA) is a sequential decision making process
where analysts choose subsequent queries that might lead to some interesting
insights based on the previous queries and corresponding results. Data
processing systems often execute the queries on samples to produce results with
low latency. Different downsampling strategy preserves different statistics of
the data and have different magnitude of latency reductions. The optimum choice
of sampling strategy often depends on the particular context of the analysis
flow and the hidden intent of the analyst. In this paper, we are the first to
consider the impact of sampling in interactive data exploration settings as
they introduce approximation errors. We propose a Deep Reinforcement Learning
(DRL) based framework which can optimize the sample selection in order to keep
the analysis and insight generation flow intact. Evaluations with 3 real
datasets show that our technique can preserve the original insight generation
flow while improving the interaction latency, compared to baseline methods.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Optimal Sample Selection Through Uncertainty Estimation and Its
Application in Deep Learning [22.410220040736235]
We present a theoretically optimal solution for addressing both coreset selection and active learning.
Our proposed method, COPS, is designed to minimize the expected loss of a model trained on subsampled data.
arXiv Detail & Related papers (2023-09-05T14:06:33Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Stream-based active learning with linear models [0.7734726150561089]
In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data.
We propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner.
The iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points.
arXiv Detail & Related papers (2022-07-20T13:15:23Z) - Statistical Inference After Adaptive Sampling for Longitudinal Data [9.468593929311867]
We develop novel methods to perform a variety of statistical analyses on adaptively sampled data via Z-estimation.
We develop novel theoretical tools for empirical processes on non-i.i.d., adaptively sampled longitudinal data which may be of independent interest.
arXiv Detail & Related papers (2022-02-14T23:48:13Z) - On Sampling Collaborative Filtering Datasets [9.041133460836361]
We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms.
We develop an oracle, Data-Genie, which can suggest the sampling scheme that is most likely to preserve model performance for a given dataset.
arXiv Detail & Related papers (2022-01-13T02:39:22Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.