Fast Implementation of Morphological Filtering Using ARM NEON Extension
- URL: http://arxiv.org/abs/2002.09474v1
- Date: Wed, 19 Feb 2020 12:55:34 GMT
- Title: Fast Implementation of Morphological Filtering Using ARM NEON Extension
- Authors: Elena Limonova and Arseny Terekhin and Dmitry Nikolaev and Vladimir
Arlazarov
- Abstract summary: We consider speedup potential of morphological image filtering on ARM processors.
We propose fast implementation of erosion and dilation using ARM SIMD extension NEON.
Experiments showed 3 times efficiency increase for final implementation of erosion and dilation.
- Score: 0.9135092203041721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we consider speedup potential of morphological image filtering
on ARM processors. Morphological operations are widely used in image analysis
and recognition and their speedup in some cases can significantly reduce
overall execution time of recognition. More specifically, we propose fast
implementation of erosion and dilation using ARM SIMD extension NEON. These
operations with the rectangular structuring element are separable. They were
implemented using the advantages of separability as sequential horizontal and
vertical passes. Each pass was implemented using van Herk/Gil-Werman algorithm
for large windows and low-constant linear complexity algorithm for small
windows. Final implementation was improved with SIMD and used a combination of
these methods. We also considered fast transpose implementation of 8x8 and
16x16 matrices using ARM NEON to get additional computational gain for
morphological operations. Experiments showed 3 times efficiency increase for
final implementation of erosion and dilation compared to van Herk/Gil-Werman
algorithm without SIMD, 5.7 times speedup for 8x8 matrix transpose and 12 times
speedup for 16x16 matrix transpose compared to transpose without SIMD.
Related papers
- Fast, Scalable, Warm-Start Semidefinite Programming with Spectral
Bundling and Sketching [53.91395791840179]
We present Unified Spectral Bundling with Sketching (USBS), a provably correct, fast and scalable algorithm for solving massive SDPs.
USBS provides a 500x speed-up over the state-of-the-art scalable SDP solver on an instance with over 2 billion decision variables.
arXiv Detail & Related papers (2023-12-19T02:27:22Z) - KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition [20.592217626952507]
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process.
This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints.
arXiv Detail & Related papers (2023-10-06T22:57:25Z) - Efficient Additions and Montgomery Reductions of Large Integers for SIMD [2.362288417229025]
This paper presents efficient algorithms for performing Montgomery reductions and additions on integers larger than 512 bits.
New addition algorithm simulates the addition of large integers using a smaller addition, quickly producing the same set of carries.
For Montgomery reductions, serial multiplications are replaced with precomputations that can be effectively calculated using SIMD extensions.
arXiv Detail & Related papers (2023-08-31T03:44:49Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Strong Simulation of Linear Optical Processes [2.3131309703965135]
Given $n$ photons at the input of an $m$-mode interferometer, our algorithm computes the probabilities of all possible output states.
It outperforms the permanent-based method by an exponential factor.
arXiv Detail & Related papers (2022-06-21T17:27:17Z) - High-performance symbolic-numerics via multiple dispatch [52.77024349608834]
Symbolics.jl is an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs.
We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system.
We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers.
arXiv Detail & Related papers (2021-05-09T14:22:43Z) - Concurrent Alternating Least Squares for multiple simultaneous Canonical
Polyadic Decompositions [2.3513645401551333]
We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab.
We show how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity.
Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
arXiv Detail & Related papers (2020-10-09T16:55:46Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - Parallel 3DPIFCM Algorithm for Noisy Brain MRI Images [3.3946853660795884]
In this paper we implement the algorithm we developed in [1] called 3DPIFCM in a parallel environment by using on a GPU.
Our results show that the parallel version of the algorithm performs up to 27x faster than the original sequential version and 68x faster than GAIFCM algorithm.
arXiv Detail & Related papers (2020-02-05T20:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.