GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
- URL: http://arxiv.org/abs/2406.01876v1
- Date: Tue, 4 Jun 2024 01:08:00 GMT
- Title: GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
- Authors: Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Henrik Johnson, Nimish Amlathe, Davor Golac,
- Abstract summary: This study revisits the foundational problem within the context of large language models.
Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios.
The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
- Score: 5.22260190195909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
Related papers
- InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [81.4242018694792]
We introduce InsightBench, a benchmark dataset with three key features.
It consists of 31 datasets representing diverse business use cases such as finance and incident management.
Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z) - Wiki-TabNER:Advancing Table Interpretation Through Named Entity
Recognition [19.423556742293762]
We analyse a widely used benchmark dataset for evaluation of TI tasks.
To overcome this drawback, we construct and annotate a new more challenging dataset.
We propose a prompting framework for evaluating the newly developed large language models.
arXiv Detail & Related papers (2024-03-07T15:22:07Z) - Meta-Learning With Hierarchical Models Based on Similarity of Causal
Mechanisms [23.842687721181107]
This work is motivated by personalised medicine, where a patient is a task and complex diseases are heterogeneous across patients in cause and progression.
We introduce to meta-learning, formulated as Bayesian hierarchical modelling, a proxy measure of similarity of the causal mechanisms of tasks.
We show that such pooling improves predictions in three health-related case studies.
arXiv Detail & Related papers (2023-10-19T09:03:41Z) - Mining Java Memory Errors using Subjective Interesting Subgroups with
Hierarchical Targets [1.188383832081829]
Subgroup Discovery (SD) is a data mining method that can automatically mine incident code and extract discriminant patterns to identify the root causes of issues.
We propose a novel SD approach that can handle complex target concepts with hierarchies.
We apply this framework to investigate out-of-memory errors and demonstrate its usefulness in incident diagnosis.
arXiv Detail & Related papers (2023-10-01T20:24:59Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Subjective Learning for Open-Ended Data [12.363642151877688]
We present a novel supervised learning paradigm of learning from open-ended data.
Open-ended data inherently requires multiple single-valued deterministic mapping functions.
We show that Open-ended Supervised Learning achieves human-like task cognition without task-level supervision.
arXiv Detail & Related papers (2021-08-27T04:18:45Z) - Mining Feature Relationships in Data [0.0]
Feature relationship mining (FRM) uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data.
Our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features.
Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships.
arXiv Detail & Related papers (2021-02-02T07:06:16Z) - Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent
Semantic Parsing [52.24507547010127]
Cross-domain context-dependent semantic parsing is a new focus of research.
We present a dynamic graph framework that effectively modelling contextual utterances, tokens, database schemas, and their complicated interaction as the conversation proceeds.
The proposed framework outperforms all existing models by large margins, achieving new state-of-the-art performance on two large-scale benchmarks.
arXiv Detail & Related papers (2021-01-05T18:11:29Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.