Fugu-MT 論文翻訳(概要): STRABLE: Benchmarking Tabular Machine Learning with Strings

論文の概要: STRABLE: Benchmarking Tabular Machine Learning with Strings

arxiv url: http://arxiv.org/abs/2605.12292v1
Date: Tue, 12 May 2026 15:47:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.981992
Title: STRABLE: Benchmarking Tabular Machine Learning with Strings
Title（参考訳）: STRABLE: 文字列によるタブラル機械学習のベンチマーク
Authors: Gioia Blayer, Myung Jun Kim, Félix Lefebvre, Lennart Purucker, Alan Arazi, Eilam Shapira, Roi Reichart, Frank Hutter, Marine Le Morvan, David Holzmüller, Gaël Varoquaux,
Abstract要約: STRABLEは108のテーブルからなるベンチマークコーパスであり、様々なアプリケーションフィールドにまたがる文字列や数値を使った実世界の学習問題である。 445個のパイプラインを評価し,文字列を用いた表型学習の大規模な実証的研究を行った。野生のほとんどのテーブルは分類的に支配的であるため、単純な文字列埋め込みと組み合わせた先進的な表型学習者は、計算コストを低くして優れた予測を達成できる。
参考スコア（独自算出の注目度）: 53.03295517218137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.
Abstract（参考訳）: 表型学習のベンチマークにより、専門アーキテクチャのメリットが明らかになり、最先端のアーキテクチャが推進された。しかし、実世界のテーブルは、数値以外の文字列エントリを含むことが多く、これらの設定は、しっかりとしたベンチマークスイートが欠如しているため、過小評価されている。専用の学習者が必要か、文字列と数字のエンドツーエンドモデリングが必要か? それとも、分類的なエンコーディングのように、文字列を数値としてエンコードするのに十分だろうか? もしそうなら、結果のテーブルは数値的な表データに似ていて、同じ学習者を呼びます。これらの研究を可能にするため、STRABLEは108のテーブルからなるベンチマークコーパスであり、様々なアプリケーション分野にまたがる文字列や数値による実世界の学習問題である。 445個のパイプラインを評価し,文字列を用いた表型学習の大規模な実証的研究を行った。これらのパイプラインはエンドツーエンドのアーキテクチャとモジュラーパイプラインにまたがっており、まず文字列がエンコードされ、次に後処理され、最終的に表の学習者に渡される。野生のほとんどのテーブルは分類的に支配的であるため、単純な文字列の埋め込みと組み合わせた先進的な表型学習者は、計算コストを低く抑えることができる。自由テキスト支配テーブルでは、大きなLLMエンコーダが競合する。それらの性能は後処理にも敏感で、LLMファミリーによって異なる。最後に、STRABLEは、オーラクルランキングに近い一般化可能なパイプラインランキングにつながるため、"ストリング表型"学習を研究するためのテーブルセットであることを示す。そこで本研究では,文字列を用いた表層学習研究の基盤としてSTRABLEを確立した。

論文の概要: STRABLE: Benchmarking Tabular Machine Learning with Strings

関連論文リスト