A Big Data Lake for Multilevel Streaming Analytics
- URL: http://arxiv.org/abs/2009.12415v1
- Date: Fri, 25 Sep 2020 19:57:21 GMT
- Title: A Big Data Lake for Multilevel Streaming Analytics
- Authors: Ruoran Liu, Haruna Isah, Farhana Zulkernine
- Abstract summary: This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake.
We discuss and compare different open source and commercial platforms that can be used to develop a data lake.
We present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics.
- Score: 0.4640835690336652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large organizations are seeking to create new architectures and scalable
platforms to effectively handle data management challenges due to the explosive
nature of data rarely seen in the past. These data management challenges are
largely posed by the availability of streaming data at high velocity from
various sources in multiple formats. The changes in data paradigm have led to
the emergence of new data analytics and management architecture. This paper
focuses on storing high volume, velocity and variety data in the raw formats in
a data storage architecture called a data lake. First, we present our study on
the limitations of traditional data warehouses in handling recent changes in
data paradigms. We discuss and compare different open source and commercial
platforms that can be used to develop a data lake. We then describe our
end-to-end data lake design and implementation approach using the Hadoop
Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we
present a real-world data lake development use case for data stream ingestion,
staging, and multilevel streaming analytics which combines structured and
unstructured data. This study can serve as a guide for individuals or
organizations planning to implement a data lake solution for their use cases.
Related papers
- OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - Empowering Data Mesh with Federated Learning [5.087058648342379]
New paradigm, Data Mesh, treats domains as a first-class concern by distributing the data ownership from the central team to each data domain.
Many multi-million dollar organizations like Paypal, Netflix, and Zalando have already transformed their data analysis pipelines based on this new architecture.
We introduce a pioneering approach that incorporates Federated Learning into Data Mesh.
arXiv Detail & Related papers (2024-03-26T17:10:15Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Federated Neural Graph Databases [53.03085605769093]
We propose Federated Neural Graph Database (FedNGDB), a novel framework that enables reasoning over multi-source graph-based data while preserving privacy.
Unlike existing methods, FedNGDB can handle complex graph structures and relationships, making it suitable for various downstream tasks.
arXiv Detail & Related papers (2024-02-22T14:57:44Z) - Semantic Data Management in Data Lakes [0.0]
In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics.
One way to prevent data lakes from turning into inoperable data swamps is semantic data management.
We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access.
arXiv Detail & Related papers (2023-10-23T21:16:50Z) - Data Architecture for Digital Object Space Management Service (DOSM)
using DAT [1.8945921149936187]
This work focuses on describing the movement of data, data formats, data location, data processing (batch or real-time), data storage technologies, and main operations on the data.
Data architecture is a complex task that involves describing the flow of data from its source to its destination.
arXiv Detail & Related papers (2023-06-22T14:22:56Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - Deep Lake: a Lakehouse for Deep Learning [0.0]
Deep Lake is an open-source lakehouse for deep learning applications developed at Activeloop.
This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop.
arXiv Detail & Related papers (2022-09-22T05:04:09Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.