Related papers: NameGuess: Column Name Expansion for Tabular Data

NameGuess: Column Name Expansion for Tabular Data

URL: http://arxiv.org/abs/2310.13196v1
Date: Thu, 19 Oct 2023 23:11:37 GMT
Title: NameGuess: Column Name Expansion for Tabular Data
Authors: Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Shen Wang, Huzefa Rangwala, George Karypis
Abstract summary: We introduce a new task, called NameGuess, to expand column names as a natural language generation problem. We create a training dataset of 384K abbreviated-expanded column pairs. We enhance auto-regressive language models by conditioning on table content and column header names.
Score: 28.557115822407294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models have revolutionized many sectors, including the database industry. One common challenge when dealing with large volumes of tabular data is the pervasive use of abbreviated column names, which can negatively impact performance on various data search, access, and understanding tasks. To address this issue, we introduce a new task, called NameGuess, to expand column names (used in database schema) as a natural language generation problem. We create a training dataset of 384K abbreviated-expanded column pairs using a new data fabrication method and a human-annotated evaluation benchmark that includes 9.2K examples from real-world tables. To tackle the complexities associated with polysemy and ambiguity in NameGuess, we enhance auto-regressive language models by conditioning on table content and column header names -- yielding a fine-tuned model (with 2.7B parameters) that matches human performance. Furthermore, we conduct a comprehensive analysis (on multiple LLMs) to validate the effectiveness of table content in NameGuess and identify promising future opportunities. Code has been made available at https://github.com/amazon-science/nameguess.

Related papers

Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection [23.423794784621368]
Large Language Models (LLMs) face challenges due to schema issues and a lack of domain-specific database knowledge. This paper introduces a method of knowledge injection to enhance LLMs' ability to understand contents by incorporating prior knowledge.
arXiv Detail & Related papers (2024-09-24T09:24:03Z)
WikiTableEdit: A Benchmark for Table Editing by Natural Language Instruction [56.196512595940334]
This paper investigates the performance of Large Language Models (LLMs) in the context of table editing tasks. We leverage 26,531 tables from the Wiki dataset to generate natural language instructions for six distinct basic operations. We evaluate several representative large language models on the WikiTableEdit dataset to demonstrate the challenge of this task.
arXiv Detail & Related papers (2024-03-05T13:33:12Z)
CARTE: Pretraining and Transfer for Tabular Learning [10.155109224816334]
We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. A benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines.
arXiv Detail & Related papers (2024-02-26T18:00:29Z)
Matching Table Metadata with Business Glossaries Using Large Language Models [18.1687301652456]
We study the problem of matching table metadata to a business glossary containing data labels and descriptions. The resulting matching enables the use of an available or curated business glossary for retrieval and analysis without or before requesting access to the data contents. We leverage the power of large language models (LLMs) to design generic matching methods that do not require manual tuning.
arXiv Detail & Related papers (2023-09-08T02:23:59Z)
QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions. We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks [0.2578242050187029]
We propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, $92.4%$ on the average.
arXiv Detail & Related papers (2021-01-11T22:54:39Z)
Semantic Labeling Using a Deep Contextualized Language Model [9.719972529205101]
We propose a context-aware semantic labeling method using both the column values and context. Our new method is based on a new setting for semantic labeling, where we sequentially predict labels for an input table with missing headers. To our knowledge, we are the first to successfully apply BERT to solve the semantic labeling task.
arXiv Detail & Related papers (2020-10-30T03:04:22Z)
Empower Entity Set Expansion via Language Model Probing [58.78909391545238]
Existing set expansion methods bootstrap the seed entity set by adaptively selecting context features and extracting new entities. A key challenge for entity set expansion is to avoid selecting ambiguous context features which will shift the class semantics and lead to accumulative errors in later iterations. We propose a novel iterative set expansion framework that leverages automatically generated class names to address the semantic drift issue.
arXiv Detail & Related papers (2020-04-29T00:09:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.