Related papers: Exploiting Formal Concept Analysis for Data Modeling in Data Lakes

Exploiting Formal Concept Analysis for Data Modeling in Data Lakes

URL: http://arxiv.org/abs/2408.13265v1
Date: Sun, 11 Aug 2024 13:58:31 GMT
Title: Exploiting Formal Concept Analysis for Data Modeling in Data Lakes
Authors: Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue,
Abstract summary: This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) We represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema. We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names.
Score: 0.29998889086656577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data lakes are widely used to store extensive and heterogeneous datasets for advanced analytics. However, the unstructured nature of data in these repositories introduces complexities in exploiting them and extracting meaningful insights. This motivates the need of exploring efficient approaches for consolidating data lakes and deriving a common and unified schema. This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) to systematically clean, organize, and design data structures within a data lake. We explore diverse data structures stored in our data lake at Infologic, including InfluxDB measurements and Elasticsearch indexes, aiming to derive conventions for a more accessible data model. Leveraging FCA, we represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema. Our methodology yields significant results, enabling the identification of common concepts in the data structures, such as resources along with their underlying shared fields (timestamp, type, usedRatio, etc.). Moreover, the number of distinct data structure field names is reduced by 54 percent (from 190 to 88) in the studied subset of our data lake. We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names, a significant improvement from the initial 121 field names that were needed to reach such coverage. The paper provides insights into the Infologic ecosystem, problem formulation, exploration strategies, and presents both qualitative and quantitative results.

Related papers

LLM-Driven Ontology Construction for Enterprise Knowledge Graphs [0.0]
This paper introduces OntoEKG, a pipeline designed to accelerate the generation of domain-specific unstructured from enterprise data.<n>Our approach decomposes the modelling task into two distinct phases: an extraction module that identifies core classes and properties, and an entailment module that logically these elements into a hierarchy before serialising them into standard RDF.<n>Addressing the significant lack of comprehensive benchmarks for end-to-end construction, we adopt a new evaluation dataset derived from documents across the Data, Finance, and Logistics sectors.
arXiv Detail & Related papers (2026-02-01T15:13:30Z)
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z)
Space of Data through the Lens of Multilevel Graph [0.0]
This work seeks to tackle the inherent complexity of dataspaces by introducing a novel data structure. We propose the concept of a multilevel graph, which is equipped with two fundamental operations: contraction and expansion of its topology. We provide a comprehensive suite of methods for manipulating this graph structure, establishing a robust framework for data analysis.
arXiv Detail & Related papers (2025-03-30T21:54:07Z)
ClusterGraph: a new tool for visualization and compression of multidimensional data [0.0]
This paper provides an additional layer on the output of any clustering algorithm. It provides information about the global layout of clusters, obtained from the considered clustering algorithm.
arXiv Detail & Related papers (2024-11-08T09:40:54Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study [4.742245127121496]
Structured-GraphRAG is a versatile framework designed to enhance information retrieval across structured datasets in natural language queries. Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times.
arXiv Detail & Related papers (2024-09-26T06:53:29Z)
Big data searching using words [0.0]
We introduce some fundamental ideas related to the neighborhood structure of words in data searching. We also introduce big data primal in big data searching and discuss the application of neighborhood structures in detecting anomalies in data searching.
arXiv Detail & Related papers (2024-09-10T13:46:14Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z)
Cross Modal Data Discovery over Structured and Unstructured Data Lakes [5.270224494298927]
Organizations are collecting increasingly large amounts of data for data driven decision making. These data are often dumped into a centralized repository, consisting of thousands of structured and unstructured datasets. Perversely, such mixture of datasets makes the problem of discovering elements relevant to a user's query or an analytical task very challenging.
arXiv Detail & Related papers (2023-06-01T17:34:42Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
CateCom: a practical data-centric approach to categorization of computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models. We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z)
Joint Geometric and Topological Analysis of Hierarchical Datasets [7.098759778181621]
In this paper, we focus on high-dimensional data that are organized into several hierarchical datasets. The main novelty in this work lies in the combination of two powerful data-analytic approaches: topological data analysis and geometric manifold learning. We show that our new method gives rise to superior classification results compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-04-03T13:02:00Z)
Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets. We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy. Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z)
Mapping Patterns for Virtual Knowledge Graphs [71.61234136161742]
Virtual Knowledge Graphs (VKG) constitute one of the most promising paradigms for integrating and accessing legacy data sources. We build on well-established methodologies and patterns studied in data management, data analysis, and conceptual modeling. We validate our catalog on the considered VKG scenarios, showing it covers the vast majority of patterns present therein.
arXiv Detail & Related papers (2020-12-03T13:54:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.