Fugu-MT 論文翻訳(概要): Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

論文の概要: Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

arxiv url: http://arxiv.org/abs/2605.06343v1
Date: Thu, 07 May 2026 14:29:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.902561
Title: Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models
Title（参考訳）: ギャップを意識する? 語彙基礎モデルにおける実と合成の事前分布の比較
Authors: Alex O. Davies, Telmo de Menezes e Silva Filho, Nirav Ajmeri,
Abstract要約: タブラル基礎モデルは、ベンチマークリポジトリから収集されたキュレートデータセット、Webから大規模に収集されたテーブル、パラメトリック生成前のデータからサンプリングされた合成テーブルの3つのクラスのうちの1つで事前訓練されている。この作業では、表層基礎モデルのトレーニングに使用される3つの標準的アーキティパルデータセットを取ります。各コーパスは, テーブル全体, 列, 相関関係の集合的特徴を用いて特徴付け, 識別器AUCとk-NNのカバレッジ指標を用いて比較する。我々は、TabICL合成前は実テーブルの空間の狭い領域を占めており、このミスマッチは事前の最適化によっては閉じられないことを発見した。
参考スコア（独自算出の注目度）: 7.124188498356204
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.
Abstract（参考訳）: タブラル基礎モデルは、ベンチマークリポジトリから収集されたキュレートデータセット、Webから大規模に収集されたテーブル、パラメトリック生成前のデータからサンプリングされた合成テーブルの3つのクラスのうちの1つで事前訓練されている。パフォーマンスをモデル化するための事前トレーニングデータの集中性にもかかわらず、これらのコーパスが分散において相互にどのように関係し、それが下流のパフォーマンスに与える影響についてはほとんど分かっていない。 T4データセットはWebスクラッドコーパスを表すもので、KaggleのTabFMデータセットはキュレートされたテーブルであり、TabICLデータセットは一般に利用可能なパラメータを持つ唯一のよく使われる合成前のデータセットである。各コーパスは, テーブル全体, 列, 相関関係の集合的特徴を用いて特徴付け, 識別器AUCとk-NNのカバレッジ指標を用いて比較する。その結果,TabICL 合成は実テーブル空間の狭い領域を占有しており,86万以上の構成で事前のハイパーパラメータを最適化することで,このミスマッチをクローズすることは不可能であり,キュレートされたコーパスとウェブスクラッドコーパスは特徴空間の分布レベルで広く交換可能であることがわかった。意外なことに、合成事前学習データと実表との分布ギャップは、特徴ベース近接測度やTabICLの内部表現のどちらにおいても明らかに検出可能であり、実データ分布のカバレッジがTabICLの一般化の主要因ではないことを示唆している。

論文の概要: Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

関連論文リスト