LLM Embeddings for Deep Learning on Tabular Data
- URL: http://arxiv.org/abs/2502.11596v1
- Date: Mon, 17 Feb 2025 09:28:51 GMT
- Title: LLM Embeddings for Deep Learning on Tabular Data
- Authors: Boshko Koloski, Andrei Margeloiu, Xiangjian Jiang, Blaž Škrlj, Nikola Simidjievski, Mateja Jamnik,
- Abstract summary: Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them.
Existing methods deal with this heterogeneous nature of data by employing separate type-specific encoding approaches.
We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution.
- Score: 10.95164847873571
- License:
- Abstract: Tabular deep-learning methods require embedding numerical and categorical input features into high-dimensional spaces before processing them. Existing methods deal with this heterogeneous nature of tabular data by employing separate type-specific encoding approaches. This limits the cross-table transfer potential and the exploitation of pre-trained knowledge. We propose a novel approach that first transforms tabular data into text, and then leverages pre-trained representations from LLMs to encode this data, resulting in a plug-and-play solution to improv ing deep-learning tabular methods. We demonstrate that our approach improves accuracy over competitive models, such as MLP, ResNet and FT-Transformer, by validating on seven classification datasets.
Related papers
- Transfer Learning of Tabular Data by Finetuning Large Language Models [0.0]
This paper investigates the effectiveness of an application programming interface (API) and transfer learning of large language models (LLM)
LLM APIs respond to input text prompts with tokenized data and instructions, whereas transfer learning finetunes an LLM for a target classification task.
This paper proposes an end-to-end finetuning of LLM to demonstrate cross-data transfer learning on ten benchmark data sets.
arXiv Detail & Related papers (2025-01-12T16:23:18Z) - Distributionally robust self-supervised learning for tabular data [2.942619386779508]
Learning robust representation in presence of error slices is challenging, due to high cardinality features and the complexity of constructing error sets.
Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision.
Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations.
arXiv Detail & Related papers (2024-10-11T04:23:56Z) - Tabular Transfer Learning via Prompting LLMs [52.96022335067357]
We propose a novel framework, Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with large language models (LLMs)
P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts.
arXiv Detail & Related papers (2024-08-09T11:30:52Z) - Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning [0.0]
This study explores how Language Models (LMs) can be used for feature representation and prediction in machine learning tasks.
Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning.
Our findings reveal current pre-trained models should not replace conventional approaches.
arXiv Detail & Related papers (2024-06-19T21:19:37Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.
We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.
Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - Backward Lens: Projecting Language Model Gradients into the Vocabulary
Space [94.85922991881242]
We show that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs.
We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.
arXiv Detail & Related papers (2024-02-20T09:57:08Z) - Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.
Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z) - Distinguishability Calibration to In-Context Learning [31.375797763897104]
We propose a method to map a PLM-encoded embedding into a new metric space to guarantee the distinguishability of the resulting embeddings.
We also take the advantage of hyperbolic embeddings to capture the hierarchical relations among fine-grained class-associated token embedding.
arXiv Detail & Related papers (2023-02-13T09:15:00Z) - Bi-level Alignment for Cross-Domain Crowd Counting [113.78303285148041]
Current methods rely on external data for training an auxiliary task or apply an expensive coarse-to-fine estimation.
We develop a new adversarial learning based method, which is simple and efficient to apply.
We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
arXiv Detail & Related papers (2022-05-12T02:23:25Z) - Numeric Encoding Options with Automunge [0.0]
This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning.
Proposals are based on options for numeric transformations available in the Automunge open source python library platform.
arXiv Detail & Related papers (2022-02-19T02:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.