Related papers: A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching,

A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching,

URL: http://arxiv.org/abs/2012.09666v1
Date: Thu, 17 Dec 2020 15:29:41 GMT
Title: A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching,
Authors: Luka Daoud, Muhammad Kamran Latif, H S. Jacinto, Nader Rafla
Abstract summary: We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching. The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation. Our hardware implementation is 15.7 times faster than the comparable software approach.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The scale invariant feature transform (SIFT) algorithm is considered a classical feature extraction algorithm within the field of computer vision. SIFT keypoint descriptor matching is a computationally intensive process due to the amount of data consumed. In this work, we designed a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching. The accelerator core was implemented and tested on a field programmable gate array (FPGA). The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation and hits the roofline performance model, achieving the potential maximum throughput. The fully pipelined matching architecture was designed based on the consine angle distance method. Our architecture was optimized for 16-bit fixed-point operations and implemented on hardware using a Xilinx Zynq-based FPGA development board. Our proposed architecture shows a noticeable reduction of area resources compared with its counterparts in literature, while maintaining high throughput by alleviating memory bandwidth restrictions. The results show a reduction in consumed device resources of up to 91 percent in LUTs and 79 percent of BRAMs. Our hardware implementation is 15.7 times faster than the comparable software approach.

Related papers

Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z)
FPGA-QHAR: Throughput-Optimized for Quantized Human Action Recognition on The Edge [0.6254873489691849]
This paper proposed an integrated end-to-end HAR scalable HW/SW accelerator co-design based on an enhanced 8-bit quantized Two-Stream SimpleNet-PyTorch CNN architecture. Our development uses partially streaming dataflow architecture to achieve higher throughput versus network design and resource utilization trade-off. Our proposed methodology achieved nearly 81% prediction accuracy with an approximately 24 FPS real-time inference throughput at 187MHz on ZCU104.
arXiv Detail & Related papers (2023-11-04T10:38:21Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z)
Scaling Quantum Approximate Optimization on Near-term Hardware [49.94954584453379]
We quantify scaling of the expected resource requirements by optimized circuits for hardware architectures with varying levels of connectivity. We show the number of measurements, and hence total time to synthesizing solution, grows exponentially in problem size and problem graph degree. These problems may be alleviated by increasing hardware connectivity or by recently proposed modifications to the QAOA that achieve higher performance with fewer circuit layers.
arXiv Detail & Related papers (2022-01-06T21:02:30Z)
FTRANS: Energy-Efficient Acceleration of Transformers using FPGA [11.032972017827248]
We propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework significantly reduces the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.
arXiv Detail & Related papers (2020-07-16T18:58:31Z)
SPEC2: SPECtral SParsE CNN Accelerator on FPGAs [31.31419913907224]
We propose SPEC2 -- the first work to prune and accelerate spectral CNNs. We design an optimized pipeline architecture on FPGA that has efficient random access into sparse kernels. The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.
arXiv Detail & Related papers (2019-10-16T23:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.