ReS2tAC -- UAV-Borne Real-Time SGM Stereo Optimized for Embedded ARM and
CUDA Devices
- URL: http://arxiv.org/abs/2106.07927v1
- Date: Tue, 15 Jun 2021 07:29:25 GMT
- Title: ReS2tAC -- UAV-Borne Real-Time SGM Stereo Optimized for Embedded ARM and
CUDA Devices
- Authors: Boitumelo Ruf, Jonas Mohrs, Martin Weinmann, Stefan Hinz, J\"urgen
Beyerer
- Abstract summary: FPGAs were the only processing hardware capable of high-performance computing for a long time.
Recent availability of embedded GPU-based systems allows for massively parallel embedded computing on graphics hardware.
We propose an approach for real-time embedded stereo processing on ARM and DJI-enabled devices.
- Score: 0.36748639131154304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the emergence of low-cost robotic systems, such as unmanned aerial
vehicle, the importance of embedded high-performance image processing has
increased. For a long time, FPGAs were the only processing hardware that were
capable of high-performance computing, while at the same time preserving a low
power consumption, essential for embedded systems. However, the recently
increasing availability of embedded GPU-based systems, such as the NVIDIA
Jetson series, comprised of an ARM CPU and a NVIDIA Tegra GPU, allows for
massively parallel embedded computing on graphics hardware. With this in mind,
we propose an approach for real-time embedded stereo processing on ARM and
CUDA-enabled devices, which is based on the popular and widely used Semi-Global
Matching algorithm. In this, we propose an optimization of the algorithm for
embedded CUDA GPUs, by using massively parallel computing, as well as using the
NEON intrinsics to optimize the algorithm for vectorized SIMD processing on
embedded ARM CPUs. We have evaluated our approach with different configurations
on two public stereo benchmark datasets to demonstrate that they can reach an
error rate as low as 3.3%. Furthermore, our experiments show that the fastest
configuration of our approach reaches up to 46 FPS on VGA image resolution.
Finally, in a use-case specific qualitative evaluation, we have evaluated the
power consumption of our approach and deployed it on the DJI Manifold 2-G
attached to a DJI Matrix 210v2 RTK unmanned aerial vehicle (UAV), demonstrating
its suitability for real-time stereo processing onboard a UAV.
Related papers
- Benchmarking Edge AI Platforms for High-Performance ML Inference [0.0]
Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions.
While current approaches often involve scaling down modern hardware, the performance characteristics of neural network workloads can vary significantly.
We compare the latency and throughput of various linear algebra and neural network inference tasks across CPU-only, CPU/GPU, and CPU/NPU integrated solutions.
arXiv Detail & Related papers (2024-09-23T08:27:27Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Benchmarking Edge Computing Devices for Grape Bunches and Trunks
Detection using Accelerated Object Detection Single Shot MultiBox Deep
Learning Models [2.1922186455344796]
This work benchmarks the performance of different platforms for object detection in real-time.
Authors used the RetinaNet ResNet-50 fine-tuned using the natural Vine dataset.
arXiv Detail & Related papers (2022-11-21T17:02:33Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo
Matching on FPGA Platform [21.435663827158564]
We propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform.
Our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.
arXiv Detail & Related papers (2021-04-11T21:22:54Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z) - Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO [46.20949184826173]
This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms.
Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency.
arXiv Detail & Related papers (2020-03-30T14:16:23Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.