Cross Modal Data Discovery over Structured and Unstructured Data Lakes
- URL: http://arxiv.org/abs/2306.00932v3
- Date: Sun, 16 Jul 2023 12:24:50 GMT
- Title: Cross Modal Data Discovery over Structured and Unstructured Data Lakes
- Authors: Mohamed Y. Eltabakh, Mayuresh Kunjir, Ahmed Elmagarmid, Mohammad
Shahmeer Ahmad
- Abstract summary: Organizations are collecting increasingly large amounts of data for data driven decision making.
These data are often dumped into a centralized repository, consisting of thousands of structured and unstructured datasets.
Perversely, such mixture of datasets makes the problem of discovering elements relevant to a user's query or an analytical task very challenging.
- Score: 5.270224494298927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Organizations are collecting increasingly large amounts of data for data
driven decision making. These data are often dumped into a centralized
repository, e.g., a data lake, consisting of thousands of structured and
unstructured datasets. Perversely, such mixture of datasets makes the problem
of discovering elements (e.g., tables or documents) that are relevant to a
user's query or an analytical task very challenging. Despite the recent efforts
in data discovery, the problem remains widely open especially in the two fronts
of (1) discovering relationships and relatedness across structured and
unstructured datasets where existing techniques suffer from either scalability,
being customized for a specific problem type (e.g., entity matching or data
integration), or demolishing the structural properties on its way, and (2)
developing a holistic system for integrating various similarity measurements
and sketches in an effective way to boost the discovery accuracy. In this
paper, we propose a new data discovery system, named CMDL, for addressing these
two limitations. CMDL supports the data discovery process over both structured
and unstructured data while retaining the structural properties of tables.
Related papers
- BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study [4.742245127121496]
Structured-GraphRAG is a versatile framework designed to enhance information retrieval across structured datasets in natural language queries.
Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times.
arXiv Detail & Related papers (2024-09-26T06:53:29Z) - Exploiting Formal Concept Analysis for Data Modeling in Data Lakes [0.29998889086656577]
This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA)
We represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema.
We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names.
arXiv Detail & Related papers (2024-08-11T13:58:31Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI Systems [10.71630696651595]
Compound AI systems (CASs) that employ LLMs as agents to accomplish knowledge-intensive tasks have garnered significant interest within database and AI communities.
silos of multimodal data sources make it difficult to identify appropriate data sources for accomplishing the task at hand.
We propose CMDBench, a benchmark modeling the complexity of enterprise data platforms.
arXiv Detail & Related papers (2024-06-02T01:10:41Z) - Decoupled Subgraph Federated Learning [57.588938805581044]
We address the challenge of federated learning on graph-structured data distributed across multiple clients.
We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies.
We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification.
arXiv Detail & Related papers (2024-02-29T13:47:23Z) - Federated Causal Discovery [74.37739054932733]
This paper develops a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD)
It can learn the causal graph without directly touching local data and naturally handle the data heterogeneity.
Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.
arXiv Detail & Related papers (2021-12-07T08:04:12Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z) - Graph integration of structured, semistructured and unstructured data
for data journalism [4.508924138721326]
We describe a complete approach for integrating dynamic sets of heterogeneous datasets.
Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
arXiv Detail & Related papers (2020-12-16T09:59:27Z) - Graph integration of structured, semistructured and unstructured data
for data journalism [0.0]
We describe a complete approach for integrating dynamic sets of heterogeneous data sources.
Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.
arXiv Detail & Related papers (2020-07-23T08:55:09Z) - DART: Open-Domain Structured Data Record to Text Generation [91.23798751437835]
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs)
We propose a procedure of extracting semantic triples from tables that encode their structures by exploiting the semantic dependencies among table headers and the table title.
Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks.
arXiv Detail & Related papers (2020-07-06T16:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.