Online Data Collection for Efficient Semiparametric Inference
- URL: http://arxiv.org/abs/2411.03195v1
- Date: Tue, 05 Nov 2024 15:40:53 GMT
- Title: Online Data Collection for Efficient Semiparametric Inference
- Authors: Shantanu Gupta, Zachary C. Lipton, David Childers,
- Abstract summary: We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps.
We prove that both policies achieve zero regret (assessed by MSE) relative to an oracle policy.
- Score: 41.49486724979923
- License:
- Abstract: While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.
Related papers
- Compute-Constrained Data Selection [77.06528009072967]
We formalize the problem of data selection with a cost-aware utility function, and model the problem as trading off initial-selection cost for training gain.
We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Reinforced Approximate Exploratory Data Analysis [7.974685452145769]
We are first to consider the impact of sampling in interactive data exploration settings as they introduce approximation errors.
We propose a Deep Reinforcement Learning (DRL) based framework which can optimize the sample selection in order to keep the analysis and insight generation flow intact.
arXiv Detail & Related papers (2022-12-12T20:20:22Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - Sales Channel Optimization via Simulations Based on Observational Data
with Delayed Rewards: A Case Study at LinkedIn [4.6405223560607105]
Training models on data obtained from randomized experiments is ideal for making good decisions.
However, randomized experiments are often time-consuming, expensive, risky, infeasible or unethical to perform.
We build a discrete-time simulation that can handle our problem features and used it to evaluate different policies.
Our simulation results indicate that LinUCB, a simple MAB policy, consistently outperforms the other policies.
arXiv Detail & Related papers (2022-09-16T07:08:37Z) - Active Sampling of Multiple Sources for Sequential Estimation [92.37271004438406]
The objective is to design an active sampling algorithm for sequentially estimating parameters in order to form reliable estimates.
This paper adopts emph conditional estimation cost functions, leading to a sequential estimation approach that was recently shown to render tractable analysis.
arXiv Detail & Related papers (2022-08-10T15:58:05Z) - Efficient Online Estimation of Causal Effects by Deciding What to
Observe [26.222870185443913]
We aim to estimate any functional of a probabilistic model (e.g., a causal effect) as efficiently as possible, by deciding, at each time, which data source to query.
We propose online moment selection (OMS), a framework in which structural assumptions are encoded as moment conditions.
Our algorithms balance exploration with choosing the best action as suggested by current estimates of the moments.
arXiv Detail & Related papers (2021-08-20T17:00:56Z) - Adaptive Sequential Design for a Single Time-Series [2.578242050187029]
We learn an optimal, unknown choice of the controlled components of a design in order to optimize the expected outcome.
We adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time.
arXiv Detail & Related papers (2021-01-29T22:51:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.