Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
- URL: http://arxiv.org/abs/2508.07179v1
- Date: Sun, 10 Aug 2025 05:04:32 GMT
- Title: Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
- Authors: Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu,
- Abstract summary: "Semantic drift" compromises data and governance, and impairs the utility of services like text-to-RAG.<n>This paper proposes a novel framework for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts.<n>Result: A 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting.
- Score: 3.3705400036304205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.
Related papers
- LLMStructBench: Benchmarking Large Language Model Structured Data Extraction [1.338174941551702]
We present a novel benchmark for evaluating Large Language Models (LLMs)<n>Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity.<n>We show that choosing the right prompting strategy is more important than standard attributes such as model size.
arXiv Detail & Related papers (2026-02-16T13:37:58Z) - CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z) - Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider [2.1178416840822027]
This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset.<n>We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model's architecture.
arXiv Detail & Related papers (2025-08-06T16:49:13Z) - Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z) - The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats [0.0]
This study systematically evaluating Large Language Models' ability to convert unstructured text into structured formats.<n>Experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance.<n>These findings open new possibilities for automated structured data generation across various domains.
arXiv Detail & Related papers (2025-03-04T14:14:28Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - Generation of Asset Administration Shell with Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0 [0.6749750044497732]
This research introduces a novel approach for achieving semantic interoperability in digital twins.
It assists the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0.
arXiv Detail & Related papers (2024-03-25T21:37:30Z) - Proton: Probing Schema Linking Information from Pre-trained Language
Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric.
Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences.
Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.