Related papers: Efficient FPGA Implementation of Time-Domain Popcount for Low-Complexity Machine Learning

Efficient FPGA Implementation of Time-Domain Popcount for Low-Complexity Machine Learning

URL: http://arxiv.org/abs/2505.02181v1
Date: Sun, 04 May 2025 16:44:15 GMT
Title: Efficient FPGA Implementation of Time-Domain Popcount for Low-Complexity Machine Learning
Authors: Shengyu Duan, Marcos L. L. Sartori, Rishad Shafik, Alex Yakovlev, Emre Ozer,
Abstract summary: Population count (popcount) is a crucial operation for many low-complexity machine learning (ML) algorithms.<n>We propose an innovative approach to accelerate and optimize these operations by performing them in the time domain.
Score: 0.2663045001864042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Population count (popcount) is a crucial operation for many low-complexity machine learning (ML) algorithms, including Tsetlin Machine (TM)-a promising new ML method, particularly well-suited for solving classification tasks. The inference mechanism in TM consists of propositional logic-based structures within each class, followed by a majority voting scheme, which makes the classification decision. In TM, the voters are the outputs of Boolean clauses. The voting mechanism comprises two operations: popcount for each class and determining the class with the maximum vote by means of an argmax operation. While TMs offer a lightweight ML alternative, their performance is often limited by the high computational cost of popcount and comparison required to produce the argmax result. In this paper, we propose an innovative approach to accelerate and optimize these operations by performing them in the time domain. Our time-domain implementation uses programmable delay lines (PDLs) and arbiters to efficiently manage these tasks through delay-based mechanisms. We also present an FPGA design flow for practical implementation of the time-domain popcount, addressing delay skew and ensuring that the behavior matches that of the model's intended functionality. By leveraging the natural compatibility of the proposed popcount with asynchronous architectures, we demonstrate significant improvements in an asynchronous TM, including up to 38% reduction in latency, 43.1% reduction in dynamic power, and 15% savings in resource utilization, compared to synchronous TMs using adder-based popcount.

Related papers

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
Event-Driven Digital-Time-Domain Inference Architectures for Tsetlin Machines [6.161316627062721]
Machine learning fits model parameters to approximate input-output mappings, predicting unknown samples.<n>These models often require extensive arithmetic computations during inference, increasing latency and power consumption.<n>This paper proposes a digital-time-domain computing approach for Tsetlin machine (TM) inference process to address these challenges.
arXiv Detail & Related papers (2025-11-12T18:24:46Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization [0.4499833362998488]
The Tsetlin Machine (TM) offers high-speed inference on resource-constrained devices such as CPUs.<n>We propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations.<n>We introduce an early exit mechanism, which exploits the TM's AND-based clause evaluation to avoid unnecessary computations.
arXiv Detail & Related papers (2025-10-17T13:44:20Z)
Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z)
LOP: Learning Optimal Pruning for Efficient On-Demand MLLMs Scaling [52.1366057696919]
LOP is an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint.<n>LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint.<n> Experimental results show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
arXiv Detail & Related papers (2025-06-15T12:14:16Z)
CCLSTM: Coupled Convolutional Long-Short Term Memory Network for Occupancy Flow Forecasting [0.0]
We propose textbfCoupled Convolutional LSTM (CTM), a lightweight, end-to-end trainable architecture based solely on convolutional operations.<n>CTM achieves state-of-the-art performance on occupancy flow metrics and, as of this submission, ranks (textst) in all metrics on the 2024 Occupancy and Flow Prediction Challenge leaderboard.
arXiv Detail & Related papers (2025-06-06T14:38:55Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Learning Symbolic Persistent Macro-Actions for POMDP Solving Over Time [52.03682298194168]
This paper proposes an integration of temporal logical reasoning and Partially Observable Markov Decision Processes (POMDPs)<n>Our method leverages a fragment of Linear Temporal Logic (LTL) based on Event Calculus (EC) to generate emphpersistent (i.e., constant) macro-actions.<n>These macro-actions guide Monte Carlo Tree Search (MCTS)-based POMDP solvers over a time horizon.
arXiv Detail & Related papers (2025-05-06T16:08:55Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Online Scheduling for LLM Inference with KV Cache Constraints [22.155429544207827]
Large Language Model (LLM) inference is an intensive process requiring efficient scheduling to optimize latency and resource utilization.<n>We propose novel and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory.<n>Our results offer a path toward more sustainable and cost-effective LLM deployment.
arXiv Detail & Related papers (2025-02-10T23:11:44Z)
Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs [0.2294388534633318]
eFPGAs allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget.<n>The limited eFPGA logic and memory significantly constrain compute capabilities and model size.<n>The proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput.
arXiv Detail & Related papers (2025-02-10T12:49:22Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays [10.484851004093919]
This paper formally describes the notion of Markov Decision Processes (MDPs) with delays. We show that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure. We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with delays in actions and observations.
arXiv Detail & Related papers (2021-08-17T10:45:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.