Related papers: Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

URL: http://arxiv.org/abs/2508.07179v1
Date: Sun, 10 Aug 2025 05:04:32 GMT
Title: Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
Authors: Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu,
Abstract summary: "Semantic drift" compromises data and governance, and impairs the utility of services like text-to-RAG.<n>This paper proposes a novel framework for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts.<n>Result: A 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting.
Score: 3.3705400036304205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.

Related papers

LLMStructBench: Benchmarking Large Language Model Structured Data Extraction [1.338174941551702]
We present a novel benchmark for evaluating Large Language Models (LLMs)<n>Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity.<n>We show that choosing the right prompting strategy is more important than standard attributes such as model size.
arXiv Detail & Related papers (2026-02-16T13:37:58Z)
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z)
Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider [2.1178416840822027]
This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset.<n>We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model's architecture.
arXiv Detail & Related papers (2025-08-06T16:49:13Z)
Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z)
The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats [0.0]
This study systematically evaluating Large Language Models' ability to convert unstructured text into structured formats.<n>Experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance.<n>These findings open new possibilities for automated structured data generation across various domains.
arXiv Detail & Related papers (2025-03-04T14:14:28Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
Generation of Asset Administration Shell with Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0 [0.6749750044497732]
This research introduces a novel approach for achieving semantic interoperability in digital twins. It assists the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0.
arXiv Detail & Related papers (2024-03-25T21:37:30Z)
Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences. Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z)
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.