SMUTF: Schema Matching Using Generative Tags and Hybrid Features
- URL: http://arxiv.org/abs/2402.01685v2
- Date: Tue, 6 Feb 2024 06:03:13 GMT
- Title: SMUTF: Schema Matching Using Generative Tags and Hybrid Features
- Authors: Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, Richard Tzong-Han Tsai
- Abstract summary: SMUTF assumes that supervised learning does not affect performance in open-domain tasks.
In an innovative adaptation inspired by the Humanitarian Exchange Language, we deploy 'generative tags' for each data column.
SMUTF exhibits extensive versatility, working seamlessly with any pre-existing pre-trained embeddings, classification methods, and generative models.
- Score: 6.471515752693932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SMUTF, a unique approach for large-scale tabular data schema
matching (SM), which assumes that supervised learning does not affect
performance in open-domain tasks, thereby enabling effective cross-domain
matching. This system uniquely combines rule-based feature engineering,
pre-trained language models, and generative large language models. In an
innovative adaptation inspired by the Humanitarian Exchange Language, we deploy
'generative tags' for each data column, enhancing the effectiveness of SM.
SMUTF exhibits extensive versatility, working seamlessly with any pre-existing
pre-trained embeddings, classification methods, and generative models.
Recognizing the lack of extensive, publicly available datasets for SM, we
have created and open-sourced the HDXSM dataset from the public humanitarian
data. We believe this to be the most exhaustive SM dataset currently available.
In evaluations across various public datasets and the novel HDXSM dataset,
SMUTF demonstrated exceptional performance, surpassing existing
state-of-the-art models in terms of accuracy and efficiency, and} improving the
F1 score by 11.84% and the AUC of ROC by 5.08%.
Related papers
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization [1.6958018695660049]
We show that $textbfonly emerges$ when training data is diversified enough across semantic domains.
We extend our analysis to real-world scenarios, including fine-tuning of $textit$textbfspecialist$$ and $textit$textbfgeneralist$$ models.
arXiv Detail & Related papers (2024-10-07T03:15:11Z) - Language Models are Graph Learners [70.14063765424012]
Language Models (LMs) are challenging the dominance of domain-specific models, including Graph Neural Networks (GNNs) and Graph Transformers (GTs)
We propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art GNNs on node classification tasks.
arXiv Detail & Related papers (2024-10-03T08:27:54Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition [3.4923338594757674]
Large language models (LLMs) can be used to train a model capable of extracting various types of entities.
In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets.
Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems.
arXiv Detail & Related papers (2024-04-27T09:00:39Z) - UniPredict: Large Language Models are Universal Tabular Classifiers [33.811778526930745]
This paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict.
We train a single LLM on an aggregation of 169 datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately.
We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline.
arXiv Detail & Related papers (2023-10-05T02:37:09Z) - GenHPF: General Healthcare Predictive Framework with Multi-task
Multi-source Learning [9.406539794019581]
General Healthcare Predictive Framework (GenHPF) is applicable to any EHR with minimal preprocessing for multiple prediction tasks.
Our framework significantly outperforms baseline models that utilize domain knowledge in multi-source learning.
arXiv Detail & Related papers (2022-07-20T12:46:26Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Multi-Domain Adversarial Feature Generalization for Person
Re-Identification [52.835955258959785]
We propose a multi-dataset feature generalization network (MMFA-AAE)
It is capable of learning a universal domain-invariant feature representation from multiple labeled datasets and generalizing it to unseen' camera systems.
It also surpasses many state-of-the-art supervised methods and unsupervised domain adaptation methods by a large margin.
arXiv Detail & Related papers (2020-11-25T08:03:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.