Fugu-MT 論文翻訳(概要): Towards Pretraining Text Encoders for TabPFN

論文の概要: Towards Pretraining Text Encoders for TabPFN

arxiv url: http://arxiv.org/abs/2606.04876v1
Date: Wed, 03 Jun 2026 13:38:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.783059
Title: Towards Pretraining Text Encoders for TabPFN
Title（参考訳）: TabPFN用テキストエンコーダの事前学習に向けて
Authors: Mustafa Tajjar, Alexander Pfefferle, Lennart Purucker, Frank Hutter,
Abstract要約: TabPFNのようなタブラル基礎モデルは、数値データと分類データを持つデータセット上で強力なパフォーマンスを達成する。 TabPFN Text Adapter (text-to-TFM token projection) を導入する。この設計はPCAのボトルネックを排除し、TabPFNの数値的な強みを保ち、エンドツーエンドのテキストタブラルパイプラインよりも訓練が効率的である。
参考スコア（独自算出の注目度）: 78.5840707720685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tabular foundation models, such as TabPFN, achieve strong performance on tabular datasets with numerical and categorical data, but do not natively handle high-cardinality text features. Standard pipelines, therefore, embed text with a language model and compress the resulting vectors with PCA into a small number of scalar features before inputting them into TabPFN. This creates an information bottleneck: most embedding dimensions are discarded, and the compressed representation must then be expanded again by TabPFN's feature encoder. End-to-end alternatives can avoid PCA, but they require large amounts of pretraining data containing text cells and usually perform subpar compared to tabular foundation models that were pretrained on large amounts of synthetic data. Inspired by modality-alignment approaches like LLaVA (vision-to-LLM token projection) and TableGPT-style systems (table-to-LLM token projection), we introduce the TabPFN Text Adapter (text-to-TFM token projection). We freeze both the sentence encoder and TabPFN, and train only a lightweight adapter that maps text embeddings into a short sequence of tokens in TabPFN's embedding space. This design removes the PCA bottleneck, preserves TabPFN's numerical strengths, and is more efficient to train than end-to-end text-tabular pipelines.
Abstract（参考訳）: TabPFNのようなタブラル基礎モデルは、数値的および分類的なデータを持つ表付きデータセット上で強力な性能を達成するが、高カード性テキストの特徴をネイティブに扱わない。したがって、標準パイプラインは、言語モデルにテキストを埋め込んで、結果のベクトルをPCAで圧縮し、TabPFNに入力する前に少数のスカラー機能に組み込む。ほとんどの埋め込み次元は破棄され、圧縮された表現はTabPFNの機能エンコーダによって再び拡張されなければならない。エンド・ツー・エンドの代替手段はPCAを回避できるが、テキストセルを含む大量の事前学習データを必要とし、通常大量の合成データで事前訓練された表形式の基礎モデルと比較してサブパーを実行する。 LLaVA(vision-to-LLMトークンプロジェクション)やTableGPT(table-to-LLMトークンプロジェクション)のようなモダリティアライメントアプローチに着想を得て,TabPFN Text Adapter(text-to-TFMトークンプロジェクション)を導入する。文エンコーダとTabPFNの両方を凍結し、テキスト埋め込みをTabPFNの埋め込み空間内の短いトークン列にマッピングする軽量アダプタのみを訓練する。この設計はPCAのボトルネックを排除し、TabPFNの数値的な強みを保ち、エンドツーエンドのテキストタブラルパイプラインよりも訓練が効率的である。

論文の概要: Towards Pretraining Text Encoders for TabPFN

関連論文リスト