Benchmarking Multimodal AutoML for Tabular Data with Text Fields
- URL: http://arxiv.org/abs/2111.02705v1
- Date: Thu, 4 Nov 2021 09:29:16 GMT
- Title: Benchmarking Multimodal AutoML for Tabular Data with Text Fields
- Authors: Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola
- Abstract summary: We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
- Score: 83.43249184357053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the use of automated supervised learning systems for data tables
that not only contain numeric/categorical columns, but one or more text fields
as well. Here we assemble 18 multimodal data tables that each contain some text
fields and stem from a real business application. Our publicly-available
benchmark enables researchers to comprehensively evaluate their own methods for
supervised learning with numeric, categorical, and text features. To ensure
that any single modeling strategy which performs well over all 18 datasets will
serve as a practical foundation for multimodal text/tabular AutoML, the diverse
datasets in our benchmark vary greatly in: sample size, problem types (a mix of
classification and regression tasks), number of features (with the number of
text columns ranging from 1 to 28 between datasets), as well as how the
predictive signal is decomposed between text vs. numeric/categorical features
(and predictive interactions thereof). Over this benchmark, we evaluate various
straightforward pipelines to model such data, including standard two-stage
approaches where NLP is used to featurize the text such that AutoML for tabular
data can then be applied. Compared with human data science teams, the fully
automated methodology that performed best on our benchmark (stack ensembling a
multimodal Transformer with various tree models) also manages to rank 1st place
when fit to the raw text/tabular data in two MachineHack prediction
competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price
Suggestion Challenge.
Related papers
- InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - Text2Topic: Multi-Label Text Classification System for Efficient Topic
Detection in User Generated Content with Zero-Shot Capabilities [2.7311827519141363]
We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance.
Text2Topic supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference.
The model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP.
arXiv Detail & Related papers (2023-10-23T11:33:24Z) - A multi-model-based deep learning framework for short text multiclass
classification with the imbalanced and extremely small data set [0.6875312133832077]
This paper proposes a multimodel-based deep learning framework for short-text multiclass classification with an imbalanced and extremely small data set.
It retains the state-of-the-art baseline performance in terms of precision, recall, accuracy, and F1 score.
arXiv Detail & Related papers (2022-06-24T00:51:02Z) - Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval [129.25914272977542]
RetoMaton is a weighted finite automaton built on top of the datastore.
Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity.
arXiv Detail & Related papers (2022-01-28T21:38:56Z) - Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models [2.5621280373733604]
Some questions cannot be answered by text alone but require information stored in tables.
We present an approach for retrieving both texts and tables relevant to a question by jointly encoding texts, tables and questions into a single vector space.
We release the newly created multi-modal dataset to the community so that it can be used for training and evaluation.
arXiv Detail & Related papers (2021-08-09T14:02:00Z) - AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data [120.2298620652828]
We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models.
Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate.
arXiv Detail & Related papers (2020-03-13T23:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.