Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures
- URL: http://arxiv.org/abs/2112.13150v1
- Date: Fri, 24 Dec 2021 22:34:51 GMT
- Title: Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures
- Authors: Cesar Carranza, Daniel Llamocca, and Marios Pattichis
- Abstract summary: The basic idea is to map 2D convolutions and cross-correlations to a collection of 1D convolutions and cross-correlations in the transform domain.
The approach uses scalable architectures that can be fitted into modern FPGA and Zynq-SOC devices.
- Score: 2.2940141855172027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The manuscript describes fast and scalable architectures and associated
algorithms for computing convolutions and cross-correlations. The basic idea is
to map 2D convolutions and cross-correlations to a collection of 1D
convolutions and cross-correlations in the transform domain. This is
accomplished through the use of the Discrete Periodic Radon Transform (DPRT)
for general kernels and the use of SVD-LU decompositions for low-rank kernels.
The approach uses scalable architectures that can be fitted into modern FPGA
and Zynq-SOC devices. Based on different types of available resources, for
$P\times P$ blocks, 2D convolutions and cross-correlations can be computed in
just $O(P)$ clock cycles up to $O(P^2)$ clock cycles. Thus, there is a
trade-off between performance and required numbers and types of resources. We
provide implementations of the proposed architectures using modern programmable
devices (Virtex-7 and Zynq-SOC). Based on the amounts and types of required
resources, we show that the proposed approaches significantly outperform
current methods.
Related papers
- Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond [7.280765035096294]
We propose a method for speeding up LCSMs' exact inference to quasilinear $O(Llog2L)$ time.
We provide a proof of concept implementation for Hyena, which gets up to $1.6times$ end-to-end improvement over standard inference.
arXiv Detail & Related papers (2024-10-16T19:23:46Z) - Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity [11.71206628091551]
Diffusion models are costly to train and evaluate, reducing the inference cost for diffusion models remains a major goal.
Inspired by the recent empirical success in accelerating diffusion models via the parallel sampling techniqueciteshih2024parallel, we propose to divide the sampling process into $mathcalO(1)$ blocks with parallelizable Picard iterations within each block.
Our results shed light on the potential of fast and efficient sampling of high-dimensional data on fast-evolving modern large-memory GPU clusters.
arXiv Detail & Related papers (2024-05-24T23:59:41Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity [36.341893383865745]
VexIR2Vec is an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR)
We learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner.
VexIR2Vec is $3.1$-$3.5 times$ faster than the closest baselines and orders-of-magnitude faster than other tools.
arXiv Detail & Related papers (2023-12-01T11:22:10Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - Compiling Quantum Circuits for Dynamically Field-Programmable Neutral Atoms Array Processors [5.012570785656963]
Dynamically field-programmable qubit arrays (DPQA) have emerged as a promising platform for quantum information processing.
In this paper, we consider a DPQA architecture that contains multiple arrays and supports 2D array movements.
We show that our DPQA-based compiled circuits feature reduced scaling overhead compared to a grid fixed architecture.
arXiv Detail & Related papers (2023-06-06T08:13:10Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Fast and Scalable Computation of the Forward and Inverse Discrete
Periodic Radon Transform [2.2940141855172027]
The Discrete Periodic Radon Transform (DPRT) has been extensively used in applications that involve image reconstructions from projections.
This manuscript introduces a fast and scalable approach for computing the forward and inverse DPRT.
arXiv Detail & Related papers (2021-12-24T22:33:13Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - High-performance symbolic-numerics via multiple dispatch [52.77024349608834]
Symbolics.jl is an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs.
We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system.
We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers.
arXiv Detail & Related papers (2021-05-09T14:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.