Related papers: NNTile: a machine learning framework capable of training extremely large GPT language models on a single node

NNTile: a machine learning framework capable of training extremely large GPT language models on a single node

URL: http://arxiv.org/abs/2504.13236v1
Date: Thu, 17 Apr 2025 16:22:32 GMT
Title: NNTile: a machine learning framework capable of training extremely large GPT language models on a single node
Authors: Aleksandr Mikhalev, Aleksandr Katrutsa, Konstantin Sozykin, Ivan Oseledets,
Abstract summary: NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units.<n>It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices.
Score: 83.9328245724548
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.

Related papers

Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks [5.3800094588915375]
We study the training dynamics of two-layer fully connected networks in the infinite-width limit.<n>We show how a sufficiently large ensemble of such models can be trained to execute exactly, with high probability.<n>We show how this can be efficiently achieved using only logarithmically many training data.
arXiv Detail & Related papers (2025-02-24T00:50:02Z)
NEAR: A Training-Free Pre-Estimator of Machine Learning Model Performance [0.0]
We propose a zero-cost proxy textitNetwork Expressivity by Activation Rank (NEAR) to identify the optimal network without training.<n>We demonstrate the cutting-edge correlation between this network score and the model accuracy on NAS-Bench-101 and NATS-Bench-SSS/TSS.
arXiv Detail & Related papers (2024-08-16T14:38:14Z)
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals [58.83169560132308]
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks.
arXiv Detail & Related papers (2024-07-18T17:59:01Z)
BOLT: An Automated Deep Learning Framework for Training and Deploying Large-Scale Search and Recommendation Models on Commodity CPU Hardware [28.05159031634185]
BOLT is a sparse deep learning library for training large-scale search and recommendation models on standard CPU hardware. We evaluate BOLT on a number of information retrieval tasks including product recommendations, text classification, graph neural networks, and personalization.
arXiv Detail & Related papers (2023-03-30T22:03:43Z)
OFA$^2$: A Multi-Objective Perspective for the Once-for-All Neural Architecture Search [79.36688444492405]
Once-for-All (OFA) is a Neural Architecture Search (NAS) framework designed to address the problem of searching efficient architectures for devices with different resources constraints. We aim to give one step further in the search for efficiency by explicitly conceiving the search stage as a multi-objective optimization problem.
arXiv Detail & Related papers (2023-03-23T21:30:29Z)
Split-Et-Impera: A Framework for the Design of Distributed Deep Learning Applications [8.434224141580758]
Split-Et-Impera determines the set of the best-split points of a neural network based on deep network interpretability principles. It performs a communication-aware simulation for the rapid evaluation of different neural network rearrangements. It suggests the best match between the quality of service requirements of the application and the performance in terms of accuracy and latency time.
arXiv Detail & Related papers (2023-03-22T13:00:00Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Reservoir Stack Machines [77.12475691708838]
Memory-augmented neural networks equip a recurrent neural network with an explicit memory to support tasks that require information storage. We introduce the reservoir stack machine, a model which can provably recognize all deterministic context-free languages. Our results show that the reservoir stack machine achieves zero error, even on test sequences longer than the training data.
arXiv Detail & Related papers (2021-05-04T16:50:40Z)
Reservoir Memory Machines as Neural Computers [70.5993855765376]
Differentiable neural computers extend artificial neural networks with an explicit memory without interference. We achieve some of the computational capabilities of differentiable neural computers with a model that can be trained very efficiently.
arXiv Detail & Related papers (2020-09-14T12:01:30Z)
Exposing Hardware Building Blocks to Machine Learning Frameworks [4.56877715768796]
We focus on how to design topologies that complement such a view of neurons as unique functions. We develop a library that supports training a neural network with custom sparsity and quantization.
arXiv Detail & Related papers (2020-04-10T14:26:00Z)
Neuroevolution of Neural Network Architectures Using CoDeepNEAT and Keras [0.0]
A large portion of the work involved in a machine learning project is to define the best type of algorithm to solve a given problem. Finding the optimal network topology and configurations for a given problem is a challenge that requires domain knowledge and testing efforts.
arXiv Detail & Related papers (2020-02-11T19:03:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.