Related papers: FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

URL: http://arxiv.org/abs/2102.12103v1
Date: Wed, 24 Feb 2021 07:22:38 GMT
Title: FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism
Authors: Je Yang, Seongmin Hong, Joo-Young Kim
Abstract summary: FIXAR employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR was implemented on Xilinx U50 and 25293.3 inferences per second (IPS) training throughput and 2638.0 IPS/W accelerator efficiency.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this paper, we present a deep reinforcement learning platform named FIXAR which employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Starting from 32-bit fixed-point data, Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR proposes the adaptive array processing core composed of configurable processing elements to support both intra-layer parallelism and intra-batch parallelism for high-throughput inference and training. Finally, FIXAR was implemented on Xilinx U50 and achieves 25293.3 inferences per second (IPS) training throughput and 2638.0 IPS/W accelerator efficiency, which is 2.7 times faster and 15.4 times more energy efficient than those of the CPU-GPU platform without any accuracy degradation.

Related papers

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile [7.544642148576768]
SimpleFSDP is a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework. It has a simple implementation for maintenance and computationsability, allows full compo-communication graph tracing, and brings performance enhancement via compiler backend optimizations. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping.
arXiv Detail & Related papers (2024-11-01T00:43:54Z)
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training [47.07768822212081]
COAT (States and Activations for FP8 Training) is a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16. COAT also achieves a 1.43x end-to-end training speedup compared to BF16.
arXiv Detail & Related papers (2024-10-25T05:59:30Z)
Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data [59.6985168241067]
Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources. We propose a new FL framework, FedDUMAP, to leverage the shared insensitive data on the server and the distributed data in edge devices. Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches.
arXiv Detail & Related papers (2024-08-11T02:59:11Z)
Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers [15.37318446043671]
This paper introduces a novel reduced precision optimization technique for On-Device Learning (ODL) primitives on MCU-class devices. Our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs.
arXiv Detail & Related papers (2023-05-30T16:14:16Z)
FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server [64.94942635929284]
Federated Learning (FL) suffers from two critical challenges, i.e., limited computational resources and low training efficiency. We propose a novel FL framework, FedDUAP, to exploit the insensitive data on the server and the decentralized data in edge devices. By integrating the two original techniques together, our proposed FL model, FedDUAP, significantly outperforms baseline approaches in terms of accuracy (up to 4.8% higher), efficiency (up to 2.8 times faster), and computational cost (up to 61.9% smaller)
arXiv Detail & Related papers (2022-04-25T10:00:00Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo Matching on FPGA Platform [21.435663827158564]
We propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform. Our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.
arXiv Detail & Related papers (2021-04-11T21:22:54Z)
Hybrid In-memory Computing Architecture for the Training of Deep Neural Networks [5.050213408539571]
We propose a hybrid in-memory computing architecture for the training of deep neural networks (DNNs) on hardware accelerators. We show that HIC-based training results in about 50% less inference model size to achieve baseline comparable accuracy. Our simulations indicate HIC-based training naturally ensures that the number of write-erase cycles seen by the devices is a small fraction of the endurance limit of PCM.
arXiv Detail & Related papers (2021-02-10T05:26:27Z)
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients. FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z)
Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs [13.83645579871775]
Large-scale convolutional neural networks (CNNs) suffer from very long training times, spanning from hours to weeks. This work pushes the boundary of quantised training by employing a multilevel approach that utilises multiple precisions. MuPPET achieves the same accuracy as standard full-precision training with training-time speedup of up to 1.84$times$ and an average speedup of 1.58$times$ across the networks.
arXiv Detail & Related papers (2020-06-16T10:14:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.