Matching Table Metadata with Business Glossaries Using Large Language
Models
- URL: http://arxiv.org/abs/2309.11506v1
- Date: Fri, 8 Sep 2023 02:23:59 GMT
- Title: Matching Table Metadata with Business Glossaries Using Large Language
Models
- Authors: Elita Lobo, Oktie Hassanzadeh, Nhan Pham, Nandana Mihindukulasooriya,
Dharmashankar Subramanian, Horst Samulowitz
- Abstract summary: We study the problem of matching table metadata to a business glossary containing data labels and descriptions.
The resulting matching enables the use of an available or curated business glossary for retrieval and analysis without or before requesting access to the data contents.
We leverage the power of large language models (LLMs) to design generic matching methods that do not require manual tuning.
- Score: 18.1687301652456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enterprises often own large collections of structured data in the form of
large databases or an enterprise data lake. Such data collections come with
limited metadata and strict access policies that could limit access to the data
contents and, therefore, limit the application of classic retrieval and
analysis solutions. As a result, there is a need for solutions that can
effectively utilize the available metadata. In this paper, we study the problem
of matching table metadata to a business glossary containing data labels and
descriptions. The resulting matching enables the use of an available or curated
business glossary for retrieval and analysis without or before requesting
access to the data contents. One solution to this problem is to use
manually-defined rules or similarity measures on column names and glossary
descriptions (or their vector embeddings) to find the closest match. However,
such approaches need to be tuned through manual labeling and cannot handle many
business glossaries that contain a combination of simple as well as complex and
long descriptions. In this work, we leverage the power of large language models
(LLMs) to design generic matching methods that do not require manual tuning and
can identify complex relations between column names and glossaries. We propose
methods that utilize LLMs in two ways: a) by generating additional context for
column names that can aid with matching b) by using LLMs to directly infer if
there is a relation between column names and glossary descriptions. Our
preliminary experimental results show the effectiveness of our proposed
methods.
Related papers
- Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - RoundTable: Leveraging Dynamic Schema and Contextual Autocomplete for Enhanced Query Precision in Tabular Question Answering [11.214912072391108]
Real-world datasets often feature a vast array of attributes and complex values.
Traditional methods cannot fully relay the datasets size and complexity to the Large Language Models.
We propose a novel framework that leverages Full-Text Search (FTS) on the input table.
arXiv Detail & Related papers (2024-08-22T13:13:06Z) - Schema Matching with Large Language Models: an Experimental Study [0.580553237364985]
We investigate the use of an off-the-shelf Large Language Models (LLMs) for schema matching.
Our objective is to identify semantic correspondences between elements of two relational schemas using only names and descriptions.
arXiv Detail & Related papers (2024-07-16T15:33:00Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)
DAQu augments the original query with various (query-related) metadata across multiple tables.
We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - CARTE: Pretraining and Transfer for Tabular Learning [10.155109224816334]
We propose a neural architecture that does not need such correspondences.
As a result, we can pretrain it on background data that has not been matched.
A benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines.
arXiv Detail & Related papers (2024-02-26T18:00:29Z) - Matching of Descriptive Labels to Glossary Descriptions [4.030805205247758]
We propose a framework to leverage an existing semantic text similarity measurement (STS) and augment it using semantic label enrichment and set-based collective contextualization.
We performed an experiment on two datasets derived from publicly available data sources.
arXiv Detail & Related papers (2023-10-27T07:09:04Z) - NameGuess: Column Name Expansion for Tabular Data [28.557115822407294]
We introduce a new task, called NameGuess, to expand column names as a natural language generation problem.
We create a training dataset of 384K abbreviated-expanded column pairs.
We enhance auto-regressive language models by conditioning on table content and column header names.
arXiv Detail & Related papers (2023-10-19T23:11:37Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.