GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
- URL: http://arxiv.org/abs/2406.01876v1
- Date: Tue, 4 Jun 2024 01:08:00 GMT
- Title: GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
- Authors: Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Henrik Johnson, Nimish Amlathe, Davor Golac,
- Abstract summary: This study revisits the foundational problem within the context of large language models.
Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios.
The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
- Score: 5.22260190195909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
Related papers
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - An Ensemble Scheme for Proactive Dominant Data Migration of Pervasive Tasks at the Edge [5.4327243200369555]
We propose a scheme to be implemented by autonomous edge nodes concerning their identifications of the appropriate data to be migrated to particular locations within the infrastructure.
Our objective is to equip nodes with the capability to comprehend the access patterns relating to offloaded data-driven tasks.
It is evident that these tasks depend on the processing of data that is absent from the original hosting nodes.
To infer these data intervals, we utilize an ensemble approach that integrates a statistically oriented model and a machine learning framework.
arXiv Detail & Related papers (2024-10-12T19:09:16Z) - Distributed In-Context Learning under Non-IID Among Clients [38.868357555845435]
In-context learning (ICL) provides a promising solution for few-shot adaptation by retrieving a set of data points relevant to a query.
In this paper, we show that test queries will have different preferences among clients because of non-IIDness.
We introduce a novel approach to tackle the distributed non-IID ICL problem when a data usage budget is present.
arXiv Detail & Related papers (2024-07-31T20:06:25Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features.
It consists of 100 datasets representing diverse business use cases such as finance and incident management.
Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z) - Wiki-TabNER:Advancing Table Interpretation Through Named Entity
Recognition [19.423556742293762]
We analyse a widely used benchmark dataset for evaluation of TI tasks.
To overcome this drawback, we construct and annotate a new more challenging dataset.
We propose a prompting framework for evaluating the newly developed large language models.
arXiv Detail & Related papers (2024-03-07T15:22:07Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Subjective Learning for Open-Ended Data [12.363642151877688]
We present a novel supervised learning paradigm of learning from open-ended data.
Open-ended data inherently requires multiple single-valued deterministic mapping functions.
We show that Open-ended Supervised Learning achieves human-like task cognition without task-level supervision.
arXiv Detail & Related papers (2021-08-27T04:18:45Z) - Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent
Semantic Parsing [52.24507547010127]
Cross-domain context-dependent semantic parsing is a new focus of research.
We present a dynamic graph framework that effectively modelling contextual utterances, tokens, database schemas, and their complicated interaction as the conversation proceeds.
The proposed framework outperforms all existing models by large margins, achieving new state-of-the-art performance on two large-scale benchmarks.
arXiv Detail & Related papers (2021-01-05T18:11:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.