iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo
Matching on FPGA Platform
- URL: http://arxiv.org/abs/2104.05112v1
- Date: Sun, 11 Apr 2021 21:22:54 GMT
- Title: iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo
Matching on FPGA Platform
- Authors: Tian Gao, Zishen Wan, Yuyang Zhang, Bo Yu, Yanjun Zhang, Shaoshan Liu,
Arijit Raychowdhury
- Abstract summary: We propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform.
Our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.
- Score: 21.435663827158564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stereo matching is a critical task for robot navigation and autonomous
vehicles, providing the depth estimation of surroundings. Among all stereo
matching algorithms, Efficient Large-scale Stereo (ELAS) offers one of the best
tradeoffs between efficiency and accuracy. However, due to the inherent
iterative process and unpredictable memory access pattern, ELAS can only run at
1.5-3 fps on high-end CPUs and difficult to achieve real-time performance on
low-power platforms. In this paper, we propose an energy-efficient architecture
for real-time ELAS-based stereo matching on FPGA platform. Moreover, the
original computational-intensive and irregular triangulation module is reformed
in a regular manner with points interpolation, which is much more
hardware-friendly. Optimizations, including memory management, parallelism, and
pipelining, are further utilized to reduce memory footprint and improve
throughput. Compared with Intel i7 CPU and the state-of-the-art CPU+FPGA
implementation, our FPGA realization achieves up to 38.4x and 3.32x frame rate
improvement, and up to 27.1x and 1.13x energy efficiency improvement,
respectively.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z) - SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs [3.302913401404089]
Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens.
We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
arXiv Detail & Related papers (2024-05-27T10:25:08Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Design optimization for high-performance computing using FPGA [0.0]
We optimize Tensil AI's open-source inference accelerator for maximum performance using ResNet20 trained on CIFAR.
Running the CIFAR test data set shows very little accuracy drop when rounding down from the original 32-bit floating point.
The proposed accelerator achieves a throughput of 21.12 Giga-Operations Per Second (GOP/s) with a 5.21 W on-chip power consumption at 100 MHz.
arXiv Detail & Related papers (2023-04-24T22:20:42Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Eventor: An Efficient Event-Based Monocular Multi-View Stereo
Accelerator on FPGA Platform [11.962626341154609]
Event cameras are bio-inspired vision sensors that asynchronously represent pixel-level brightness changes as event streams.
EMVS is a technique that exploits the event streams to estimate semi-dense 3D structure with known trajectory.
In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages.
arXiv Detail & Related papers (2022-03-29T11:13:36Z) - ReS2tAC -- UAV-Borne Real-Time SGM Stereo Optimized for Embedded ARM and
CUDA Devices [0.36748639131154304]
FPGAs were the only processing hardware capable of high-performance computing for a long time.
Recent availability of embedded GPU-based systems allows for massively parallel embedded computing on graphics hardware.
We propose an approach for real-time embedded stereo processing on ARM and DJI-enabled devices.
arXiv Detail & Related papers (2021-06-15T07:29:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.