LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation
- URL: http://arxiv.org/abs/2602.08793v1
- Date: Mon, 09 Feb 2026 15:30:07 GMT
- Title: LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation
- Authors: Yushi Sun, Xujia Li, Nan Tang, Quanqing Xu, Chuanhui Yang, Lei Chen,
- Abstract summary: Column type annotation is vital for tasks like data cleaning, integration, and visualization.<n>Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables.<n>We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions.
- Score: 18.72484471043965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.
Related papers
- LakeMLB: Data Lake Machine Learning Benchmark [15.634664259138157]
We present LakeMLB (Data Lake Machine Learning Benchmark), designed for the most common multi-source, multi-table scenarios in data lakes.<n>LakeMLB focuses on two representative multi-table scenarios, Union and Join, and provides three real-world datasets for each scenario, covering government open data, finance, Wikipedia, and online marketplaces.
arXiv Detail & Related papers (2026-02-11T02:33:29Z) - OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes.<n>Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - LLM-assisted Labeling Function Generation for Semantic Type Detection [5.938962712331031]
We propose using weak supervision to assist in annotating the training data for semantic type detection by leveraging labeling functions.
One challenge in this process is the difficulty of manually writing labeling functions due to the large volume and low quality of the data lake table datasets.
arXiv Detail & Related papers (2024-08-28T23:39:50Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - Dated Data: Tracing Knowledge Cutoffs in Large Language Models [47.987664966633865]
We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM.
We find that effective cutoffs often differ from reported cutoffs.
Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates.
arXiv Detail & Related papers (2024-03-19T17:57:58Z) - Retrieve, Merge, Predict: Augmenting Tables with Data Lakes [7.449868392714658]
We present an in-depth analysis of automated table augmentation for machine learning tasks.<n>We analyze different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table.<n>We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake)
arXiv Detail & Related papers (2024-02-09T09:48:38Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Deep Lake: a Lakehouse for Deep Learning [0.0]
Deep Lake is an open-source lakehouse for deep learning applications developed at Activeloop.
This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop.
arXiv Detail & Related papers (2022-09-22T05:04:09Z) - Do We Really Need to Access the Source Data? Source Hypothesis Transfer
for Unsupervised Domain Adaptation [102.67010690592011]
Unsupervised adaptationUDA (UDA) aims to leverage the knowledge learned from a labeled source dataset to solve similar tasks in a new unlabeled domain.
Prior UDA methods typically require to access the source data when learning to adapt the model.
This work tackles a practical setting where only a trained source model is available and how we can effectively utilize such a model without source data to solve UDA problems.
arXiv Detail & Related papers (2020-02-20T03:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.