Fugu-MT 論文翻訳(概要): Testing with Non-identically Distributed Samples

論文の概要: Testing with Non-identically Distributed Samples

arxiv url: http://arxiv.org/abs/2311.11194v1
Date: Sun, 19 Nov 2023 01:25:50 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-22 06:55:55.151796
Title: Testing with Non-identically Distributed Samples
Title（参考訳）: 非同一分散サンプルによるテスト
Authors: Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant
Abstract要約: 本研究では,サンプルが独立に分布するが同一に分布しない設定に対して,サブ線形サンプル特性試験と推定が適用範囲について検討する。それぞれのディストリビューションから$Theta(k/varepsilon2)$サンプルをサンプリングしても、$textbfp_mathrmavg$は、テレビ距離で$textbfp_mathrmavg$をエラー$varepsilon$内で学習するのに十分である。
参考スコア（独自算出の注目度）: 20.74768558932617
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $\textbf{p}_{\mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $\textbf{p}_{\mathrm{avg}}$ to within error $\varepsilon$ in TV distance. To test uniformity or identity -- distinguishing the case that $\textbf{p}_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing of the i.i.d. setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $\textbf{p}_i$) can perform uniformity testing.
Abstract（参考訳）: サンプルが独立に分布するが同一に分布しない環境では,サブ線形サンプル特性試験と推定がどの程度適用されるかを検討する。具体的には、以下の分散特性テストフレームワークについて検討する。 $k$, $\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_t$の離散的なサポートの上に一連のディストリビューションが存在すると仮定し、各ディストリビューションから$c$独立ドローを得る。平均分布のプロパティを学習またはテストすることを目標とすると、$\textbf{p}_{\mathrm{avg}}$である。 This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $\textbf{p}_{\mathrm{avg}}$ to within error $\varepsilon$ in TV distance. To test uniformity or identity -- distinguishing the case that $\textbf{p}_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. 対照的に、$c \ge 2$ の場合、通常の i.i.d. のサブリニアなサンプル試験を復元する: $o(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ のサンプルは、$\varepsilon \ge k^{-1/4}$ の条件下での最適なサンプル複雑性に合致する。さらに、$c=2$の場合、$\rho > 0$ が存在して、$\rho k$ サンプルを持つ線形状態であっても、サンプルの多重集合(同じ $\textbf{p}_i$ から抽出されたサンプルを無視する)を考えるテスターは、均一性テストを行うことができない。

関連論文リスト

Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models [65.71506381302815]
本稿では、$p(mathbfxmidmathbfy) propto p_theta(mathbfx)$ という形式の後続分布からサンプリングするコストを償却する。多くのモデルと関心の制約に対して、ノイズ空間の後方はデータ空間の後方よりも滑らかであり、そのような償却推論に対してより快適である。
論文参考訳（メタデータ） (2025-02-10T19:49:54Z)
On the query complexity of sampling from non-log-concave distributions [2.4253233571593547]
密度$p(x)propto e-f(x)$を持つ$d$次元分布からサンプリングする問題を、必ずしも良好な等尺条件を満たすとは限らない。広い範囲のパラメータに対して、サンプリングは$d$の超指数係数による最適化よりも厳密に容易であることを示す。
論文参考訳（メタデータ） (2025-02-10T06:54:16Z)
Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
以前の$mathRd上の分布に関する民間推定者は、次元性の呪いに苦しむ。本稿では,サンプルの複雑さが次元依存性を改善したアルゴリズムを提案する。
論文参考訳（メタデータ） (2024-11-01T17:59:53Z)
Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
幅広い種類のデータ構造に対して、それらの境界は著しく改善されないことを示す。これは密度推定のための新しい統計計算トレードオフである。
論文参考訳（メタデータ） (2024-10-30T15:03:33Z)
Testing Identity of Distributions under Kolmogorov Distance in Polylogarithmic Space [1.2277343096128712]
本稿では、ストリーミング設定において、空間$O(log4 varepsilon-1)$を使用するアルゴリズムを提供する。また、私たちは9つの関連するオープンな問題を述べ、それと関連した問題への関心を喚起することを望んでいます。
論文参考訳（メタデータ） (2024-10-29T15:24:27Z)
Outlier Robust Multivariate Polynomial Regression [27.03423421704806]
1,1]n 回 mathbbR$ は $(mathbfx_i,p(mathbfx_i)$ のうるさいバージョンである。目標は、$hatp$を$ell_in$-distanceの$O(sigma)$を$p$から出力することである。
論文参考訳（メタデータ） (2024-03-14T15:04:45Z)
Efficient Sampling of Stochastic Differential Equations with Positive Semi-Definite Models [91.22420505636006]
本稿では, ドリフト関数と拡散行列を考慮し, 微分方程式からの効率的なサンプリング問題を扱う。 1/varepsilonは$m2d log (1/varepsilon)$である。以上の結果から,真の解がより滑らかになるにつれて,どのような凸性も必要とせず,次元の呪いを回避できることが示唆された。
論文参考訳（メタデータ） (2023-03-30T02:50:49Z)
Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions [8.40077201352607]
我々は,高次元共分散認識平均推定のための高速,微分プライベートなアルゴリズムを提案する。我々のアルゴリズムは$tildemu$を生成し、$|mu|_Sigma leq alpha$が$n gtrsim tfrac d alpha2 + tfracd sqrtlog 1/deltaalpha varepsilon+fracdlog 1/deltavarepsilon$である。
論文参考訳（メタデータ） (2023-01-28T16:57:46Z)
Near-Optimal Bounds for Testing Histogram Distributions [35.18069719489173]
ヒストグラム検査問題はサンプル複雑性$widetilde Theta (sqrtnk / varepsilon + k / varepsilon2 + sqrtn / varepsilon2)$であることを示す。
論文参考訳（メタデータ） (2022-07-14T01:24:01Z)
Independence Testing for Bounded Degree Bayesian Network [4.230271396864461]
P$ がスパース構造を持つならば、実際、多くのサンプルしか必要としないことを示す。また、もし$P$が、基礎となるDAGが$d$で有界なベイズネットワークに対してマルコフであるなら、$tildeTheta (2d/2cdot n/varepsilon2)$サンプルが必要であることも示している。
論文参考訳（メタデータ） (2022-04-19T06:16:14Z)
Tight Bounds on the Hardness of Learning Simple Nonparametric Mixtures [9.053430799456587]
有限混合系における非パラメトリック分布の学習問題について検討する。このようなモデルにおける成分分布を学習するために、サンプルの複雑さに厳密な境界を定めている。
論文参考訳（メタデータ） (2022-03-28T23:53:48Z)
The Sample Complexity of Robust Covariance Testing [56.98280399449707]
i. i. d. 形式 $Z = (1-epsilon) X + epsilon B$ の分布からのサンプル。ここで $X$ はゼロ平均で未知の共分散である Gaussian $mathcalN(0, Sigma)$ である。汚染がない場合、事前の研究は、$O(d)$サンプルを使用するこの仮説テストタスクの単純なテスターを与えた。サンプル複雑性の上限が $omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma であることを証明します。
論文参考訳（メタデータ） (2020-12-31T18:24:41Z)
Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
高確率状態に着目して離散分布を試験する問題について検討する。一定の要素でサンプル最適である近接性および独立性テストのための最初のアルゴリズムを提供する。
論文参考訳（メタデータ） (2020-09-14T16:09:17Z)
Locally Private Hypothesis Selection [96.06118559817057]
我々は、$mathcalQ$から$p$までの総変動距離が最良の分布に匹敵する分布を出力する。局所的な差分プライバシーの制約は、コストの急激な増加を引き起こすことを示す。提案アルゴリズムは,従来手法のラウンド複雑性を指数関数的に改善する。
論文参考訳（メタデータ） (2020-02-21T18:30:48Z)
Sample Amplification: Increasing Dataset Size even when Learning is Impossible [15.864702679819544]
未知のディストリビューションから引き出されたデータである$D$が、このデータセットを増幅し、さらに大きなサンプルセットを$D$から抽出したように見えるように出力することは、どの程度まで可能か? この問題は次のように定式化する: $left(n, n + Theta(fracnsqrtk)right)$アンプが存在するが、小さな定数全変動距離への分布を学習するには$Theta(d)$サンプルが必要である。
論文参考訳（メタデータ） (2019-04-26T21:42:44Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。