Schema Extraction on Semi-structured Data
- URL: http://arxiv.org/abs/2012.08105v1
- Date: Tue, 15 Dec 2020 05:57:41 GMT
- Title: Schema Extraction on Semi-structured Data
- Authors: Panpan Li, Yikun Gong, Chen Wang
- Abstract summary: Methods based on tree and graph and statistical methods based on distributed architecture and machine learning to extract schemas.
Extraction tools are mainly used for spark or datasets, and are suitable for small or simple application environments.
System focuses on the extraction and management of schemas in large data sets and complex application scenarios.
- Score: 3.09315460664784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the continuous development of NoSQL databases, more and more developers
choose to use semi-structured data for development and data management, which
puts forward requirements for schema management of semi-structured data stored
in NoSQL databases. Schema extraction plays an important role in understanding
schemas, optimizing queries, and validating data consistency. Therefore, in
this survey we investigate structural methods based on tree and graph and
statistical methods based on distributed architecture and machine learning to
extract schemas. The schemas obtained by the structural methods are more
interpretable, and the statistical methods have better applicability and
generalization ability. Moreover, we also investigate tools and systems for
schemas extraction. Schema extraction tools are mainly used for spark or NoSQL
databases, and are suitable for small datasets or simple application
environments. The system mainly focuses on the extraction and management of
schemas in large data sets and complex application scenarios. Furthermore, we
also compare these techniques to facilitate data managers' choice.
Related papers
- A Pre-training Framework for Relational Data with Information-theoretic Principles [57.93973948947743]
We introduce Task Vector Estimation (TVE), a novel pre-training framework that constructs supervisory signals via set-based aggregation over relational graphs.<n>TVE consistently outperforms traditional pre-training baselines.<n>Our findings advocate for pre-training objectives that encode task heterogeneity and temporal structure as design principles for predictive modeling on relational databases.
arXiv Detail & Related papers (2025-07-14T00:17:21Z) - Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures [50.46688111973999]
Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data.<n>We present a new blueprint that enables end-to-end representation of'relational entity graphs' without traditional engineering feature.<n>We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data.
arXiv Detail & Related papers (2025-06-19T23:51:38Z) - Schema as Parameterized Tools for Universal Information Extraction [27.4621163733051]
Universal information extraction (UIE) primarily employs an extractive generation approach with large language models (LLMs)<n>We propose a unified adaptive text-to-structure generation framework, called as structureized IE Tools (SPT)
arXiv Detail & Related papers (2025-06-02T03:12:44Z) - SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases [1.6544167074080365]
We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations.<n>We apply classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined.<n>Our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches.
arXiv Detail & Related papers (2025-05-23T20:42:36Z) - Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB [44.057784044659726]
Large language models (LLMs) have made it easier to prototype such retrieval and reasoning data pipelines.
This often involves orchestrating data systems, managing data movement, and handling low-level details.
We introduce FlockMTL: an extension for abstractions that integrates deeply LLM capabilities and retrieval-augmented generation.
arXiv Detail & Related papers (2025-04-01T19:48:17Z) - SchemaAgent: A Multi-Agents Framework for Generating Relational Database Schema [35.57815867567431]
Existing efforts are mostly based on customized rules or conventional deep learning models, often producing relational schema.
We propose a unified LLM-based multi-agent framework for the automated generation of high-quality database schema.Agent.
We incorporate dedicated roles for reflection and inspection, alongside an innovative error detection and correction mechanism to identify rectify issues across various phases.
arXiv Detail & Related papers (2025-03-31T09:39:19Z) - Space of Data through the Lens of Multilevel Graph [0.0]
This work seeks to tackle the inherent complexity of dataspaces by introducing a novel data structure.
We propose the concept of a multilevel graph, which is equipped with two fundamental operations: contraction and expansion of its topology.
We provide a comprehensive suite of methods for manipulating this graph structure, establishing a robust framework for data analysis.
arXiv Detail & Related papers (2025-03-30T21:54:07Z) - Towards Agentic Schema Refinement [3.7173623393215287]
We propose a semantic layer in-between the database and the user as a set of small and easy-to-interpret database views.
Our approach paves the way for LLM-powered exploration of unwieldy databases.
arXiv Detail & Related papers (2024-11-25T19:57:16Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study [4.742245127121496]
Structured-GraphRAG is a versatile framework designed to enhance information retrieval across structured datasets in natural language queries.
Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times.
arXiv Detail & Related papers (2024-09-26T06:53:29Z) - Exploiting Formal Concept Analysis for Data Modeling in Data Lakes [0.29998889086656577]
This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA)
We represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema.
We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names.
arXiv Detail & Related papers (2024-08-11T13:58:31Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - Powering In-Database Dynamic Model Slicing for Structured Data Analytics [31.360239181279525]
We introduce LEADS, a novel dynamic model slicing technique to customize models for specifiedsql queries.
LEADS improves the predictive modeling of structured data via the mixture of experts (MoE) and maintains efficiency by a SQL-aware gating network.
Our experiments on real-world datasets demonstrate that LEADS consistently outperforms the baseline models.
arXiv Detail & Related papers (2024-05-01T15:18:12Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Optimization Techniques for Unsupervised Complex Table Reasoning via Self-Training Framework [5.351873055148804]
Self-training framework generates diverse synthetic data with complex logic.
We optimize the procedure using a "Table-Text Manipulator" to handle joint table-text reasoning scenarios.
UCTRST achieves above 90% of the supervised model performance on different tasks and domains.
arXiv Detail & Related papers (2022-12-20T09:15:03Z) - Proton: Probing Schema Linking Information from Pre-trained Language
Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric.
Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences.
Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z) - Procedures as Programs: Hierarchical Control of Situated Agents through
Natural Language [81.73820295186727]
We propose a formalism of procedures as programs, a powerful yet intuitive method of representing hierarchical procedural knowledge for agent command and control.
We instantiate this framework on the IQA and ALFRED datasets for NL instruction following.
arXiv Detail & Related papers (2021-09-16T20:36:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.