Related papers: Fast Implementation of Morphological Filtering Using ARM NEON Extension

Fast Implementation of Morphological Filtering Using ARM NEON Extension

URL: http://arxiv.org/abs/2002.09474v1
Date: Wed, 19 Feb 2020 12:55:34 GMT
Title: Fast Implementation of Morphological Filtering Using ARM NEON Extension
Authors: Elena Limonova and Arseny Terekhin and Dmitry Nikolaev and Vladimir Arlazarov
Abstract summary: We consider speedup potential of morphological image filtering on ARM processors. We propose fast implementation of erosion and dilation using ARM SIMD extension NEON. Experiments showed 3 times efficiency increase for final implementation of erosion and dilation.
Score: 0.9135092203041721
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we consider speedup potential of morphological image filtering on ARM processors. Morphological operations are widely used in image analysis and recognition and their speedup in some cases can significantly reduce overall execution time of recognition. More specifically, we propose fast implementation of erosion and dilation using ARM SIMD extension NEON. These operations with the rectangular structuring element are separable. They were implemented using the advantages of separability as sequential horizontal and vertical passes. Each pass was implemented using van Herk/Gil-Werman algorithm for large windows and low-constant linear complexity algorithm for small windows. Final implementation was improved with SIMD and used a combination of these methods. We also considered fast transpose implementation of 8x8 and 16x16 matrices using ARM NEON to get additional computational gain for morphological operations. Experiments showed 3 times efficiency increase for final implementation of erosion and dilation compared to van Herk/Gil-Werman algorithm without SIMD, 5.7 times speedup for 8x8 matrix transpose and 12 times speedup for 16x16 matrix transpose compared to transpose without SIMD.

Related papers

Strassen Multisystolic Array Hardware Architectures [0.0]
Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication. General-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. We present and evaluate new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings.
arXiv Detail & Related papers (2025-02-14T10:40:32Z)
Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs. USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z)
KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z)
Efficient Additions and Montgomery Reductions of Large Integers for SIMD [2.362288417229025]
This paper presents efficient algorithms for performing Montgomery reductions and additions on integers larger than 512 bits. New addition algorithm simulates the addition of large integers using a smaller addition, quickly producing the same set of carries. For Montgomery reductions, serial multiplications are replaced with precomputations that can be effectively calculated using SIMD extensions.
arXiv Detail & Related papers (2023-08-31T03:44:49Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Strong Simulation of Linear Optical Processes [2.3131309703965135]
Given $n$ photons at the input of an $m$-mode interferometer, our algorithm computes the probabilities of all possible output states. It outperforms the permanent-based method by an exponential factor.
arXiv Detail & Related papers (2022-06-21T17:27:17Z)
High-performance symbolic-numerics via multiple dispatch [52.77024349608834]
Symbolics.jl is an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system. We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers.
arXiv Detail & Related papers (2021-05-09T14:22:43Z)
Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions [2.3513645401551333]
We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab. We show how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
arXiv Detail & Related papers (2020-10-09T16:55:46Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
Parallel 3DPIFCM Algorithm for Noisy Brain MRI Images [3.3946853660795884]
In this paper we implement the algorithm we developed in [1] called 3DPIFCM in a parallel environment by using on a GPU. Our results show that the parallel version of the algorithm performs up to 27x faster than the original sequential version and 68x faster than GAIFCM algorithm.
arXiv Detail & Related papers (2020-02-05T20:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.