Related papers: Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms

Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms

URL: http://arxiv.org/abs/2507.14376v1
Date: Fri, 18 Jul 2025 21:50:36 GMT
Title: Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms
Authors: Osman Erman Gungor, Derak Paulsen, William Kang,
Abstract summary: SCHEMORA is a schema matching framework that combines large language models with hybrid retrieval techniques.<n>It is evaluated on the MIMIC-OMOP benchmark, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Schema matching is essential for integrating heterogeneous data sources and enhancing dataset discovery, yet it remains a complex and resource-intensive problem. We introduce SCHEMORA, a schema matching framework that combines large language models with hybrid retrieval techniques in a prompt-based approach, enabling efficient identification of candidate matches without relying on labeled training data or exhaustive pairwise comparisons. By enriching schema metadata and leveraging both vector-based and lexical retrieval, SCHEMORA improves matching accuracy and scalability. Evaluated on the MIMIC-OMOP benchmark, it establishes new state-of-the-art performance, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results. To our knowledge, this is the first LLM-based schema matching method with an open-source implementation, accompanied by analysis that underscores the critical role of retrieval and provides practical guidance on model selection.

Related papers

Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval [49.669503570350166]
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task.<n>Existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively.<n>We propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking.
arXiv Detail & Related papers (2025-04-07T15:27:37Z)
Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching [3.7548609506798485]
We propose a Knowledge Graph-based Retrieval-Augmented Generation model for large language models (LLMs) matching.<n>In particular, KG-RAG4SM introduces novel vector-based, graph-based, and query-based graph retrievals.<n>We show that KG-RAG4SM outperforms the state-of-the-art (SOTA) methods by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset.
arXiv Detail & Related papers (2025-01-15T09:32:37Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
ReMatch: Retrieval Enhanced Schema Matching with LLMs [0.874967598360817]
We present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs) Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher.
arXiv Detail & Related papers (2024-03-03T17:14:40Z)
Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings. We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z)
Meeting Summarization with Pre-training and Clustering Methods [6.47783315109491]
HMNetcitehmnet is a hierarchical network that employs both a word-level transformer and a turn-level transformer, as the baseline. We extend the locate-then-summarize approach of QMSumciteqmsum with an intermediate clustering step. We compare the performance of our baseline models with BART, a state-of-the-art language model that is effective for summarization.
arXiv Detail & Related papers (2021-11-16T03:14:40Z)
Hyperparameter Optimization with Differentiable Metafeatures [5.586191108738563]
We propose a cross dataset surrogate model called Differentiable Metafeature-based Surrogate (DMFBS) In contrast to existing models, DMFBS i) integrates a differentiable metafeature extractor and ii) is optimized using a novel multi-task loss. We compare DMFBS against several recent models for HPO on three large meta-datasets and show that it consistently outperforms all of them with an average 10% improvement.
arXiv Detail & Related papers (2021-02-07T11:06:31Z)
Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results. We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z)
S^3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization [104.87483578308526]
We propose the model S3-Rec, which stands for Self-Supervised learning for Sequential Recommendation. For our task, we devise four auxiliary self-supervised objectives to learn the correlations among attribute, item, subsequence, and sequence. Extensive experiments conducted on six real-world datasets demonstrate the superiority of our proposed method over existing state-of-the-art methods.
arXiv Detail & Related papers (2020-08-18T11:44:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.