JUWELS Booster -- A Supercomputer for Large-Scale AI Research
- URL: http://arxiv.org/abs/2108.11976v1
- Date: Wed, 30 Jun 2021 21:37:02 GMT
- Title: JUWELS Booster -- A Supercomputer for Large-Scale AI Research
- Authors: Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia
Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler,
Amirpasha Mozaffari, Gabriele Cavallaro, Rocco Sedona, Alexander Schug,
Alexandre Strube, Roshni Kamath, Martin G. Schultz, Morris Riedel, Thomas
Lippert
- Abstract summary: We present JUWELS Booster, a recently commissioned high-performance computing system at the J"ulich Supercomputing Center.
We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance.
- Score: 79.02246047353273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this article, we present JUWELS Booster, a recently commissioned
high-performance computing system at the J\"ulich Supercomputing Center. With
its system architecture, most importantly its large number of powerful Graphics
Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an
ideal machine for large-scale Artificial Intelligence (AI) research and
applications. We detail its system architecture, parallel, distributed model
training, and benchmarks indicating its outstanding performance. We exemplify
its potential for research application by presenting large-scale AI research
highlights from various scientific fields that require such a facility.
Related papers
- Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - Using the Abstract Computer Architecture Description Language to Model
AI Hardware Accelerators [77.89070422157178]
Manufacturers of AI-integrated products face a critical challenge: selecting an accelerator that aligns with their product's performance requirements.
The Abstract Computer Architecture Description Language (ACADL) is a concise formalization of computer architecture block diagrams.
In this paper, we demonstrate how to use the ACADL to model AI hardware accelerators, use their ACADL description to map DNNs onto them, and explain the timing simulation semantics to gather performance results.
arXiv Detail & Related papers (2024-01-30T19:27:16Z) - DEAP: Design Space Exploration for DNN Accelerator Parallelism [0.0]
Large Language Models (LLMs) are becoming increasingly complex and powerful to train and serve.
This paper showcases how hardware and software co-design can come together and allow us to create customized hardware systems.
arXiv Detail & Related papers (2023-12-24T02:43:01Z) - Fast GraspNeXt: A Fast Self-Attention Neural Network Architecture for
Multi-task Learning in Computer Vision Tasks for Robotic Grasping on the Edge [80.88063189896718]
High architectural and computational complexity can result in poor suitability for deployment on embedded devices.
Fast GraspNeXt is a fast self-attention neural network architecture tailored for embedded multi-task learning in computer vision tasks for robotic grasping.
arXiv Detail & Related papers (2023-04-21T18:07:14Z) - Generative Adversarial Super-Resolution at the Edge with Knowledge
Distillation [1.3764085113103222]
Single-Image Super-Resolution can support robotic tasks in environments where a reliable visual stream is required.
We propose an efficient Generative Adversarial Network model for real-time Super-Resolution, called EdgeSRGAN.
arXiv Detail & Related papers (2022-09-07T10:58:41Z) - ISyNet: Convolutional Neural Networks design for AI accelerator [0.0]
Current state-of-the-art architectures are found with neural architecture search (NAS) taking model complexity into account.
We propose a measure of hardware efficiency of neural architecture search space - matrix efficiency measure (MEM); a search space comprising of hardware-efficient operations; a latency-aware scaling method.
We show the advantage of the designed architectures for the NPU devices on ImageNet and the generalization ability for the downstream classification and detection tasks.
arXiv Detail & Related papers (2021-09-04T20:57:05Z) - Semantic Scene Segmentation for Robotics Applications [51.66271681532262]
We investigate the behavior of the most successful semantic scene segmentation models, in terms of deployment (inference) speed, under various setups.
The target of this work is to provide a comparative study of current state-of-the-art segmentation models so as to select the most compliant with the robotics applications requirements.
arXiv Detail & Related papers (2021-08-25T08:55:20Z) - How to Reach Real-Time AI on Consumer Devices? Solutions for
Programmable and Custom Architectures [7.085772863979686]
Deep neural networks (DNNs) have led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition.
deploying such AI models across commodity devices faces significant challenges.
We present techniques for achieving real-time performance following a cross-stack approach.
arXiv Detail & Related papers (2021-06-21T11:23:12Z) - The Architectural Implications of Distributed Reinforcement Learning on
CPU-GPU Systems [45.479582612113205]
We show how to improve the performance and power efficiency of RL training on CPU-GPU systems.
We quantify the overall hardware utilization on a state-of-the-art distributed RL training framework.
We also introduce a new system design metric, CPU/GPU ratio, and show how to find the optimal balance between CPU and GPU resources.
arXiv Detail & Related papers (2020-12-08T04:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.