Statistical Estimation from Dependent Data
- URL: http://arxiv.org/abs/2107.09773v1
- Date: Tue, 20 Jul 2021 21:18:06 GMT
- Title: Statistical Estimation from Dependent Data
- Authors: Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Surbhi Goel,
Anthimos Vardis Kandiros
- Abstract summary: We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors.
We model these dependencies in the language of Markov Random Fields.
We provide algorithms and statistically efficient estimation rates for this model.
- Score: 37.73584699735133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a general statistical estimation problem wherein binary labels
across different observations are not independent conditioned on their feature
vectors, but dependent, capturing settings where e.g. these observations are
collected on a spatial domain, a temporal domain, or a social network, which
induce dependencies. We model these dependencies in the language of Markov
Random Fields and, importantly, allow these dependencies to be substantial, i.e
do not assume that the Markov Random Field capturing these dependencies is in
high temperature. As our main contribution we provide algorithms and
statistically efficient estimation rates for this model, giving several
instantiations of our bounds in logistic regression, sparse logistic
regression, and neural network settings with dependent data. Our estimation
guarantees follow from novel results for estimating the parameters (i.e.
external fields and interaction strengths) of Ising models from a {\em single}
sample. {We evaluate our estimation approach on real networked data, showing
that it outperforms standard regression approaches that ignore dependencies,
across three text classification datasets: Cora, Citeseer and Pubmed.}
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Inference at the data's edge: Gaussian processes for modeling and inference under model-dependency, poor overlap, and extrapolation [0.0]
The Gaussian Process (GP) is a flexible non-linear regression approach.
It provides a principled approach to handling our uncertainty over predicted (counterfactual) values.
This is especially valuable under conditions of extrapolation or weak overlap.
arXiv Detail & Related papers (2024-07-15T05:09:50Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Statistical inference of travelers' route choice preferences with
system-level data [4.120057972557892]
We develop a methodology to estimate travelers' utility functions with multiple attributes using system-level data.
Experiments on synthetic data show that the coefficients are consistently recovered and that hypothesis tests are a reliable statistic to identify which attributes are determinants of travelers' route choices.
The methodology is also deployed at a large scale using real Fresnoworld multisource data collected during the COVID outbreak.
arXiv Detail & Related papers (2022-04-23T00:38:32Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Learning Invariant Representations with Missing Data [18.307438471163774]
Models that satisfy particular independencies involving correlation-inducing textitnuisance variables have guarantees on their test performance.
We derive acrshortmmd estimators used for invariance objectives under missing nuisances.
On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.
arXiv Detail & Related papers (2021-12-01T23:14:34Z) - Aggregation as Unsupervised Learning and its Evaluation [9.109147994991229]
We present an empirical evaluation framework that allows assessing the proposed approach against other aggregation approaches.
We use regression data sets from the UCI machine learning repository and benchmark several data-agnostic and unsupervised approaches for aggregation.
The benchmark results indicate that our approach outperforms the other data-agnostic and unsupervised aggregation approaches.
arXiv Detail & Related papers (2021-10-28T14:10:30Z) - Evaluating Model Robustness and Stability to Dataset Shift [7.369475193451259]
We propose a framework for analyzing stability of machine learning models.
We use the original evaluation data to determine distributions under which the algorithm performs poorly.
We estimate the algorithm's performance on the "worst-case" distribution.
arXiv Detail & Related papers (2020-10-28T17:35:39Z) - On Disentangled Representations Learned From Correlated Data [59.41587388303554]
We bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data.
We show that systematically induced correlations in the dataset are being learned and reflected in the latent representations.
We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
arXiv Detail & Related papers (2020-06-14T12:47:34Z) - TraDE: Transformers for Density Estimation [101.20137732920718]
TraDE is a self-attention-based architecture for auto-regressive density estimation.
We present a suite of tasks such as regression using generated samples, out-of-distribution detection, and robustness to noise in the training data.
arXiv Detail & Related papers (2020-04-06T07:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.