Related papers: GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

URL: http://arxiv.org/abs/2406.01876v1
Date: Tue, 4 Jun 2024 01:08:00 GMT
Title: GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
Authors: Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Henrik Johnson, Nimish Amlathe, Davor Golac,
Abstract summary: This study revisits the foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
Score: 5.22260190195909
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.

Related papers

Same Content, Different Representations: A Controlled Study for Table QA [15.896655757672441]
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields.<n>Existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance.<n>We present the first controlled study that isolates the role of table representation by holding content constant while varying structure.
arXiv Detail & Related papers (2025-09-26T22:33:19Z)
Adapting Vision-Language Models Without Labels: A Comprehensive Survey [74.17944178027015]
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks.<n>Recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data.<n>We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms.
arXiv Detail & Related papers (2025-08-07T16:27:37Z)
Database-Agnostic Gait Enrollment using SetTransformers [3.3311266423308252]
We introduce a transformer-based framework for open-set gait enrollment.<n>Our method is both dataset-agnostic and recognition-architecture-agnostic.<n>We show that our method is flexible, is able to accurately perform enrollment in different scenarios, and scales better with data compared to traditional approaches.
arXiv Detail & Related papers (2025-05-05T17:42:27Z)
Financial Data Analysis with Robust Federated Logistic Regression [7.68275287892947]
In this study, we focus on the analysis of financial data in a federated setting, wherein data is distributed across multiple clients or locations. We propose a robust federated logistic regression-based framework that strives to strike a balance between these goals.
arXiv Detail & Related papers (2025-04-28T20:42:24Z)
Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models [14.236566119377352]
This paper presents TiInsight, an automated cross-domain exploratory data analysis system. TiInsight achieves hierarchical execution accuracy of 86.3% on the Spider dataset using GPT-4. It also demonstrates state-of-the-art performance on the Bird dataset.
arXiv Detail & Related papers (2024-12-10T06:11:23Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
An Ensemble Scheme for Proactive Dominant Data Migration of Pervasive Tasks at the Edge [5.4327243200369555]
We propose a scheme to be implemented by autonomous edge nodes concerning their identifications of the appropriate data to be migrated to particular locations within the infrastructure. Our objective is to equip nodes with the capability to comprehend the access patterns relating to offloaded data-driven tasks. It is evident that these tasks depend on the processing of data that is absent from the original hosting nodes. To infer these data intervals, we utilize an ensemble approach that integrates a statistically oriented model and a machine learning framework.
arXiv Detail & Related papers (2024-10-12T19:09:16Z)
Distributed In-Context Learning under Non-IID Among Clients [38.868357555845435]
In-context learning (ICL) provides a promising solution for few-shot adaptation by retrieving a set of data points relevant to a query. In this paper, we show that test queries will have different preferences among clients because of non-IIDness. We introduce a novel approach to tackle the distributed non-IID ICL problem when a data usage budget is present.
arXiv Detail & Related papers (2024-07-31T20:06:25Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features. It consists of 100 datasets representing diverse business use cases such as finance and incident management. Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z)
Wiki-TabNER:Advancing Table Interpretation Through Named Entity Recognition [19.423556742293762]
We analyse a widely used benchmark dataset for evaluation of TI tasks. To overcome this drawback, we construct and annotate a new more challenging dataset. We propose a prompting framework for evaluating the newly developed large language models.
arXiv Detail & Related papers (2024-03-07T15:22:07Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
Subjective Learning for Open-Ended Data [12.363642151877688]
We present a novel supervised learning paradigm of learning from open-ended data. Open-ended data inherently requires multiple single-valued deterministic mapping functions. We show that Open-ended Supervised Learning achieves human-like task cognition without task-level supervision.
arXiv Detail & Related papers (2021-08-27T04:18:45Z)
Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing [52.24507547010127]
Cross-domain context-dependent semantic parsing is a new focus of research. We present a dynamic graph framework that effectively modelling contextual utterances, tokens, database schemas, and their complicated interaction as the conversation proceeds. The proposed framework outperforms all existing models by large margins, achieving new state-of-the-art performance on two large-scale benchmarks.
arXiv Detail & Related papers (2021-01-05T18:11:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.