Related papers: Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures

Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures

URL: http://arxiv.org/abs/2112.13150v1
Date: Fri, 24 Dec 2021 22:34:51 GMT
Title: Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures
Authors: Cesar Carranza, Daniel Llamocca, and Marios Pattichis
Abstract summary: The basic idea is to map 2D convolutions and cross-correlations to a collection of 1D convolutions and cross-correlations in the transform domain. The approach uses scalable architectures that can be fitted into modern FPGA and Zynq-SOC devices.
Score: 2.2940141855172027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The manuscript describes fast and scalable architectures and associated algorithms for computing convolutions and cross-correlations. The basic idea is to map 2D convolutions and cross-correlations to a collection of 1D convolutions and cross-correlations in the transform domain. This is accomplished through the use of the Discrete Periodic Radon Transform (DPRT) for general kernels and the use of SVD-LU decompositions for low-rank kernels. The approach uses scalable architectures that can be fitted into modern FPGA and Zynq-SOC devices. Based on different types of available resources, for $P\times P$ blocks, 2D convolutions and cross-correlations can be computed in just $O(P)$ clock cycles up to $O(P^2)$ clock cycles. Thus, there is a trade-off between performance and required numbers and types of resources. We provide implementations of the proposed architectures using modern programmable devices (Virtex-7 and Zynq-SOC). Based on the amounts and types of required resources, we show that the proposed approaches significantly outperform current methods.

Related papers

Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression. We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization [2.2596489829928452]
This work generalizes work on 1D $s$-step SGD and 1D Federated SGD with Averaging (FedAvg) to yield a 2D parallel SGD method (HybridSGD) We implement all algorithms in C++ and MPI and evaluate their performance on a Cray EX supercomputing system.
arXiv Detail & Related papers (2025-01-13T17:56:39Z)
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond [7.280765035096294]
We propose a method for speeding up LCSMs' exact inference to quasilinear $O(Llog2L)$ time. We provide a proof of concept implementation for Hyena, which gets up to $1.6times$ end-to-end improvement over standard inference.
arXiv Detail & Related papers (2024-10-16T19:23:46Z)
Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity [11.71206628091551]
Diffusion models are costly to train and evaluate, reducing the inference cost for diffusion models remains a major goal. Inspired by the recent empirical success in accelerating diffusion models via the parallel sampling techniqueciteshih2024parallel, we propose to divide the sampling process into $mathcalO(1)$ blocks with parallelizable Picard iterations within each block. Our results shed light on the potential of fast and efficient sampling of high-dimensional data on fast-evolving modern large-memory GPU clusters.
arXiv Detail & Related papers (2024-05-24T23:59:41Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity [36.341893383865745]
VexIR2Vec is an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR) We learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. VexIR2Vec is $3.1$-$3.5 times$ faster than the closest baselines and orders-of-magnitude faster than other tools.
arXiv Detail & Related papers (2023-12-01T11:22:10Z)
CORE: Common Random Reconstruction for Distributed Optimization with Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers. We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Compiling Quantum Circuits for Dynamically Field-Programmable Neutral Atoms Array Processors [5.012570785656963]
Dynamically field-programmable qubit arrays (DPQA) have emerged as a promising platform for quantum information processing. In this paper, we consider a DPQA architecture that contains multiple arrays and supports 2D array movements. We show that our DPQA-based compiled circuits feature reduced scaling overhead compared to a grid fixed architecture.
arXiv Detail & Related papers (2023-06-06T08:13:10Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform [2.2940141855172027]
The Discrete Periodic Radon Transform (DPRT) has been extensively used in applications that involve image reconstructions from projections. This manuscript introduces a fast and scalable approach for computing the forward and inverse DPRT.
arXiv Detail & Related papers (2021-12-24T22:33:13Z)
Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD) We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z)
High-performance symbolic-numerics via multiple dispatch [52.77024349608834]
Symbolics.jl is an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system. We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers.
arXiv Detail & Related papers (2021-05-09T14:22:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.