Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications
- URL: http://arxiv.org/abs/2405.01585v1
- Date: Sun, 28 Apr 2024 14:58:55 GMT
- Title: Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications
- Authors: Sujit Khanna, Shishir Subedi,
- Abstract summary: Tabular Embedding Model (TEM) is a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications.
TEM not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent times Large Language Models have exhibited tremendous capabilities, especially in the areas of mathematics, code generation and general-purpose reasoning. However for specialized domains especially in applications that require parsing and analyzing large chunks of numeric or tabular data even state-of-the-art (SOTA) models struggle. In this paper, we introduce a new approach to solving domain-specific tabular data analysis tasks by presenting a unique RAG workflow that mitigates the scalability issues of existing tabular LLM solutions. Specifically, we present Tabular Embedding Model (TEM), a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications. Embedding models form a crucial component in the RAG workflow and even current SOTA embedding models struggle as they are predominantly trained on textual datasets and thus underperform in scenarios involving complex tabular data. The evaluation results showcase that our approach not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.
Related papers
- Do We Need Domain-Specific Embedding Models? An Empirical Investigation [18.990655668481075]
We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to Massive Text Embedding Benchmark (MTEB)
We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB.
Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns.
arXiv Detail & Related papers (2024-09-27T07:46:06Z) - Cross-Domain Content Generation with Domain-Specific Small Language Models [3.2772349789781616]
This study explores methods to enable a small language model to produce coherent and relevant outputs for two different domains.
We find that utilizing custom tokenizers tailored to each dataset significantly enhances generation quality.
Our findings demonstrate that knowledge expansion with frozen layers is an effective method for small language models to generate domain-specific content.
arXiv Detail & Related papers (2024-09-19T21:45:13Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Observatory: Characterizing Embeddings of Relational Tables [15.808819332614712]
Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts.
There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
arXiv Detail & Related papers (2023-10-05T00:58:45Z) - Deep incremental learning models for financial temporal tabular datasets
with distribution shifts [0.9790236766474201]
The framework uses a simple basic building block (decision trees) to build self-similar models of any required complexity.
We demonstrate our scheme using XGBoost models trained on the Numerai dataset and show that a two layer deep ensemble of XGBoost models over different model snapshots delivers high quality predictions.
arXiv Detail & Related papers (2023-03-14T14:10:37Z) - Graph-Regularized Tensor Regression: A Domain-Aware Framework for
Interpretable Multi-Way Financial Modelling [23.030263841031633]
We develop a novel Graph-Regularized Regression (GRTR) framework, whereby knowledge about cross-asset relations is incorporated into the model in the form of a graph Laplacian matrix.
By virtue of tensor algebra, the proposed framework is shown to be fully interpretable, both coefficient-wise and dimension-wise.
The GRTR model is validated in a multi-way financial forecasting setting and is shown to achieve improved performance at reduced computational costs.
arXiv Detail & Related papers (2022-10-26T13:39:08Z) - Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning [65.268245109828]
In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models.
Deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning.
Model reprogramming enables resource-efficient cross-domain machine learning by repurposing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning.
arXiv Detail & Related papers (2022-02-22T02:33:54Z) - Explainable Matrix -- Visualization for Global and Local
Interpretability of Random Forest Classification Ensembles [78.6363825307044]
We propose Explainable Matrix (ExMatrix), a novel visualization method for Random Forest (RF) interpretability.
It employs a simple yet powerful matrix-like visual metaphor, where rows are rules, columns are features, and cells are rules predicates.
ExMatrix applicability is confirmed via different examples, showing how it can be used in practice to promote RF models interpretability.
arXiv Detail & Related papers (2020-05-08T21:03:48Z) - Model Reuse with Reduced Kernel Mean Embedding Specification [70.044322798187]
We present a two-phase framework for finding helpful models for a current application.
In the upload phase, when a model is uploading into the pool, we construct a reduced kernel mean embedding (RKME) as a specification for the model.
Then in the deployment phase, the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification.
arXiv Detail & Related papers (2020-01-20T15:15:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.