Online Data Collection for Efficient Semiparametric Inference
- URL: http://arxiv.org/abs/2411.03195v1
- Date: Tue, 05 Nov 2024 15:40:53 GMT
- Title: Online Data Collection for Efficient Semiparametric Inference
- Authors: Shantanu Gupta, Zachary C. Lipton, David Childers,
- Abstract summary: We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps.
We prove that both policies achieve zero regret (assessed by MSE) relative to an oracle policy.
- Score: 41.49486724979923
- License:
- Abstract: While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.
Related papers
- Model-Free Counterfactual Subset Selection at Scale [11.646993755965006]
Streaming explanations offer adaptive, real-time insights without requiring persistent storage of the entire dataset.
Our algorithm operates efficiently in streaming settings, maintaining $O(log k)$ update complexity per item.
Empirical evaluations on both real-world and synthetic datasets demonstrate superior performance over baseline methods.
arXiv Detail & Related papers (2025-02-12T11:48:15Z) - Unifying and Optimizing Data Values for Selection via Sequential-Decision-Making [5.755427480127593]
We show that data values applied for selection can be reformulated as a sequential-decision-making problem.
We propose an efficient approximation scheme using learned bipartite graphs as surrogate utility models.
arXiv Detail & Related papers (2025-02-06T23:03:10Z) - Compute-Constrained Data Selection [77.06528009072967]
We find that many powerful data selection methods are almost never compute-optimal.
For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Optimizing Data Collection for Machine Learning [87.37252958806856]
Modern deep learning systems require huge data sets to achieve impressive performance.
Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay.
We propose a new paradigm for modeling the data collection as a formal optimal data collection problem.
arXiv Detail & Related papers (2022-10-03T21:19:05Z) - Sales Channel Optimization via Simulations Based on Observational Data
with Delayed Rewards: A Case Study at LinkedIn [4.6405223560607105]
Training models on data obtained from randomized experiments is ideal for making good decisions.
However, randomized experiments are often time-consuming, expensive, risky, infeasible or unethical to perform.
We build a discrete-time simulation that can handle our problem features and used it to evaluate different policies.
Our simulation results indicate that LinUCB, a simple MAB policy, consistently outperforms the other policies.
arXiv Detail & Related papers (2022-09-16T07:08:37Z) - Active Sampling of Multiple Sources for Sequential Estimation [92.37271004438406]
The objective is to design an active sampling algorithm for sequentially estimating parameters in order to form reliable estimates.
This paper adopts emph conditional estimation cost functions, leading to a sequential estimation approach that was recently shown to render tractable analysis.
arXiv Detail & Related papers (2022-08-10T15:58:05Z) - Efficient Online Estimation of Causal Effects by Deciding What to
Observe [26.222870185443913]
We aim to estimate any functional of a probabilistic model (e.g., a causal effect) as efficiently as possible, by deciding, at each time, which data source to query.
We propose online moment selection (OMS), a framework in which structural assumptions are encoded as moment conditions.
Our algorithms balance exploration with choosing the best action as suggested by current estimates of the moments.
arXiv Detail & Related papers (2021-08-20T17:00:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.