Related papers: Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO

Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO

URL: http://arxiv.org/abs/2003.13493v3
Date: Mon, 3 Aug 2020 09:22:13 GMT
Title: Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO
Authors: Balazs Nagy, Philipp Foehn, Davide Scaramuzza
Abstract summary: This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency.
Score: 46.20949184826173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent introduction of powerful embedded graphics processing units (GPUs) has allowed for unforeseen improvements in real-time computer vision applications. It has enabled algorithms to run onboard, well above the standard video rates, yielding not only higher information processing capability, but also reduced latency. This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms in the field of visual-inertial odometry (VIO). While most steps of a VIO pipeline work on visual features, they rely on image data for detection and tracking, of which both steps are well suited for parallelization. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency. Our work first revisits the problem of non-maxima suppression for feature detection specifically on GPUs, and proposes a solution that selects local response maxima, imposes spatial feature distribution, and extracts features simultaneously. Our second contribution introduces an enhanced FAST feature detector that applies the aforementioned non-maxima suppression method. Finally, we compare our method to other state-of-the-art CPU and GPU implementations, where we always outperform all of them in feature tracking and detection, resulting in over 1000fps throughput on an embedded Jetson TX2 platform. Additionally, we demonstrate our work integrated in a VIO pipeline achieving a metric state estimation at ~200fps.

Related papers

Faster than Fast: Accelerating Oriented FAST Feature Detection on Low-end Embedded GPUs [11.639825636679454]
This paper presents two methods to accelerate the Oriented FAST feature detection on low-end embedded GPU.<n>Experiments on a Jetson TX2 embedded GPU demonstrate an average speedup of over 7.3 times compared to widely used OpenCV with GPU support.
arXiv Detail & Related papers (2025-06-08T14:30:30Z)
Edge-GPU Based Face Tracking for Face Detection and Recognition Acceleration [0.0]
This paper suggests a combined hardware-software approach to optimize face detection and recognition systems on NVIDIA Jetson AGX Orin.<n>The results suggest that simultaneous usage of all the hardware engines that are available in the Orin GPU and tracker integration into the pipeline yield an impressive throughput of 290 FPS (frames per second) on 1920 x 1080 input size frames containing in average of 6 faces/frame.<n>This hardware-codesign approach can pave the way to design high-performance machine vision systems at the edge, critically needed in video monitoring in public places.
arXiv Detail & Related papers (2025-05-07T15:57:53Z)
Leveraging Simultaneous Usage of Edge GPU Hardware Engines for Video Face Detection and Recognition [0.0]
This paper aims to maximize the simultaneous usage of hardware engines available in edge GPUs.<n>It includes the video decoding task, which is required in most face monitoring applications.<n>Results suggest that simultaneously using all the hardware engines available in the recent NVIDIA edge Orin GPU, higher throughput, and a slight saving of power consumption of around 300 mW, accounting for around 5%.
arXiv Detail & Related papers (2025-05-07T15:22:17Z)
VR-Pipe: Streamlining Hardware Graphics Pipeline for Volume Rendering [1.9470707535768061]
We implement the state-of-the-art radiance field method, 3D Gaussian splatting, using graphics APIs and evaluate it across synthetic and real-world scenes on today's graphics hardware. We present VR-Pipe, which seamlessly integrates two innovations into graphics hardware to streamline the hardware pipeline for volume rendering. Our evaluation shows that VR-Pipe greatly improves rendering performance, achieving up to a 2.78x speedup over the conventional graphics pipeline with negligible hardware overhead.
arXiv Detail & Related papers (2025-02-24T11:46:36Z)
FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems [5.572152653851948]
FULL-W2V exploits the opportunities for data reuse in the W2V algorithm to reduce access to low memory levels and improve temporal locality. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality.
arXiv Detail & Related papers (2023-12-12T21:22:07Z)
High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison [0.0]
We present a versatile GPU-based parallel version of Logistic Regression (LR) Our implementation is a direct translation of the parallel Gradient Descent Logistic Regression algorithm proposed by X. Zou et al. Our method is particularly advantageous for real-time prediction applications like image recognition, spam detection, and fraud detection.
arXiv Detail & Related papers (2023-08-19T14:49:37Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources. We build a unified framework for efficient end-to-end temporal action detection (ETAD) ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Motion Vector Extrapolation for Video Object Detection [0.0]
MOVEX enables low latency video object detection on common CPU based systems. We show that our approach significantly reduces the baseline latency of any given object detector. Further latency reduction, up to 25x lower than the original latency, can be achieved with minimal accuracy loss.
arXiv Detail & Related papers (2021-04-18T17:26:37Z)
GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z)
Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging. We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.