Deep Lake: a Lakehouse for Deep Learning
- URL: http://arxiv.org/abs/2209.10785v1
- Date: Thu, 22 Sep 2022 05:04:09 GMT
- Title: Deep Lake: a Lakehouse for Deep Learning
- Authors: Sasun Hambardzumyan, Abhinav Tuli, Levon Ghukasyan, Fariz Rahman,
Hrant Topchyan, David Isayan, Mikayel Harutyunyan, Tatevik Hakobyan, Ivo
Stranic, Davit Buniatyan
- Abstract summary: Deep Lake is an open-source lakehouse for deep learning applications developed at Activeloop.
This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional data lakes provide critical data infrastructure for analytical
workloads by enabling time travel, running SQL queries, ingesting data with
ACID transactions, and visualizing petabyte-scale datasets on cloud storage.
They allow organizations to break down data silos, unlock data-driven
decision-making, improve operational efficiency, and reduce costs. However, as
deep learning takes over common analytical workflows, traditional data lakes
become less useful for applications such as natural language processing (NLP),
audio processing, computer vision, and applications involving non-tabular
datasets. This paper presents Deep Lake, an open-source lakehouse for deep
learning applications developed at Activeloop. Deep Lake maintains the benefits
of a vanilla data lake with one key difference: it stores complex data, such as
images, videos, annotations, as well as tabular data, in the form of tensors
and rapidly streams the data over the network to (a) Tensor Query Language, (b)
in-browser visualization engine, or (c) deep learning frameworks without
sacrificing GPU utilization. Datasets stored in Deep Lake can be accessed from
PyTorch, TensorFlow, JAX, and integrate with numerous MLOps tools.
Related papers
- Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - Retrieve, Merge, Predict: Augmenting Tables with Data Lakes [7.449868392714658]
We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table.
As data lakes, the paper uses YADL (Yet Another Data Lake) and Open Data US, a well-referenced real data lake.
arXiv Detail & Related papers (2024-02-09T09:48:38Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Semantic Data Management in Data Lakes [0.0]
In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics.
One way to prevent data lakes from turning into inoperable data swamps is semantic data management.
We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access.
arXiv Detail & Related papers (2023-10-23T21:16:50Z) - TensorBank: Tensor Lakehouse for Foundation Model Training [1.8811254972035676]
Streaming and storing high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language.
We introduceBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries.
This architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
arXiv Detail & Related papers (2023-09-05T10:00:33Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - A Big Data Lake for Multilevel Streaming Analytics [0.4640835690336652]
This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake.
We discuss and compare different open source and commercial platforms that can be used to develop a data lake.
We present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics.
arXiv Detail & Related papers (2020-09-25T19:57:21Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.