TensorBank: Tensor Lakehouse for Foundation Model Training
- URL: http://arxiv.org/abs/2309.02094v3
- Date: Thu, 21 Mar 2024 09:03:48 GMT
- Title: TensorBank: Tensor Lakehouse for Foundation Model Training
- Authors: Romeo Kienzler, Leonardo Pondian Tizzei, Benedikt Blumenstiel, Zoltan Arnold Nagy, S. Karthik Mukkavilli, Johannes Schmude, Marcus Freitag, Michael Behrendt, Daniel Salles Civitarese, Naomi Simumba, Daiki Kimura, Hendrik Hamann,
- Abstract summary: Streaming and storing high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language.
We introduceBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries.
This architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
- Score: 1.8811254972035676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.
Related papers
- Automating Data Science Pipelines with Tensor Completion [4.956678070210018]
We model data science pipelines as instances of tensor completion.
The goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values.
We extensively evaluate existing and proposed methods in a number of datasets.
arXiv Detail & Related papers (2024-10-08T22:34:08Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Federated Learning with Heterogeneous Architectures using Graph
HyperNetworks [154.60662664160333]
We propose a new FL framework that accommodates heterogeneous client architecture by adopting a graph hypernetwork for parameter sharing.
Unlike existing solutions, our framework does not limit the clients to share the same architecture type, makes no use of external data and does not require clients to disclose their model architecture.
arXiv Detail & Related papers (2022-01-20T21:36:25Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - TyXe: Pyro-based Bayesian neural nets for Pytorch [12.343312954353639]
We introduce TyXe, a Bayesian neural network library built on top of Pytorch and Pyro.
Our leading design principle is to cleanly separate architecture, prior, inference and likelihood specification.
In contrast to existing packages TyXe does not implement any layer classes, and instead relies on architectures defined in generic Pytorch code.
arXiv Detail & Related papers (2021-10-01T09:04:26Z) - HetSeq: Distributed GPU Training on Heterogeneous Infrastructure [13.689451154861203]
HetSeq is a software package that provides the capability to train large neural network models on heterogeneous infrastructure.
Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems.
arXiv Detail & Related papers (2020-09-25T19:57:42Z) - Captum: A unified and generic model interpretability library for PyTorch [49.72749684393332]
We introduce a novel, unified, open-source model interpretability library for PyTorch.
The library contains generic implementations of a number of gradient and perturbation-based attribution algorithms.
It can be used for both classification and non-classification models.
arXiv Detail & Related papers (2020-09-16T18:57:57Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation.
We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z) - How to 0wn NAS in Your Spare Time [11.997555708723523]
We design an algorithm that reconstructs the key components of a novel deep learning system by exploiting a small amount of information leakage from a cache side-channel attack.
We demonstrate experimentally that we can reconstruct MalConv, a novel data pre-processing pipeline for malware detection, and ProxylessNAS CPU-NAS, a novel network architecture for ImageNet classification.
arXiv Detail & Related papers (2020-02-17T05:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.