Related papers: Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization

Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization

URL: http://arxiv.org/abs/2510.15653v1
Date: Fri, 17 Oct 2025 13:44:20 GMT
Title: Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization
Authors: Yefan Zeng, Shengyu Duan, Rishad Shafik, Alex Yakovlev,
Abstract summary: The Tsetlin Machine (TM) offers high-speed inference on resource-constrained devices such as CPUs.<n>We propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations.<n>We introduce an early exit mechanism, which exploits the TM's AND-based clause evaluation to avoid unnecessary computations.
Score: 0.4499833362998488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Tsetlin Machine (TM) offers high-speed inference on resource-constrained devices such as CPUs. Its logic-driven operations naturally lend themselves to parallel execution on modern CPU architectures. Motivated by this, we propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations for compact model representation and accelerated processing. To further improve inference speed, we introduce an early exit mechanism, which exploits the TM's AND-based clause evaluation to avoid unnecessary computations. Building upon this, we propose a literal Reorder strategy designed to maximize the likelihood of early exits. This strategy is applied during a post-training, pre-inference stage through statistical analysis of all literals and the corresponding actions of their associated Tsetlin Automata (TA), introducing negligible runtime overhead. Experimental results using the gem5 simulator with an ARM processor show that our optimized implementation reduces inference time by up to 96.71% compared to the conventional integer-based TM implementations while maintaining comparable code density.

Related papers

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
An LLVM-Based Optimization Pipeline for SPDZ [0.0]
We implement a proof-of-concept LLVM-based optimization pipeline for the SPDZ protocol.<n>Our front end accepts a subset of C with lightweight privacy annotations and lowers it to LLVM IR.<n>Our back end performs data-flow and control-flow analysis on the optimized IR to drive a non-blocking runtime scheduler.
arXiv Detail & Related papers (2025-12-11T20:53:35Z)
The Fast for the Curious: How to accelerate fault-tolerant quantum applications [101.46859364118622]
We evaluate strategies for reducing the run time of fault-tolerant quantum computations.<n>We discuss how the co-design of hardware, fault tolerance, and algorithmic subroutines can reduce run times.
arXiv Detail & Related papers (2025-10-30T02:27:55Z)
Pushing the Envelope of LLM Inference on AI-PC [45.081663877447816]
ultra-low-bit models (1/1.58/2-bit) match the perplexity and end-task performance of their full-precision counterparts using the same model size.<n>The computational efficiency of state-of-the-art inference runtimes (e.g. bitnet) used to deploy them remains underexplored.<n>We take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency.<n>We present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet
arXiv Detail & Related papers (2025-08-08T23:33:38Z)
Efficient FPGA Implementation of Time-Domain Popcount for Low-Complexity Machine Learning [0.2663045001864042]
Population count (popcount) is a crucial operation for many low-complexity machine learning (ML) algorithms.<n>We propose an innovative approach to accelerate and optimize these operations by performing them in the time domain.
arXiv Detail & Related papers (2025-05-04T16:44:15Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework.<n>APB uses multi-host approximate attention to enhance prefill speed.<n>APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z)
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference [10.608817382813786]
Mixed-precision matrix (mpGEMM) is an important yet underexplored operation involving the multiplication of lower-precision weights with higher-precision activations.<n>Off-the-shelf hardware does not support this operation, leading to indirect, thus inefficient, dequantization-based implementations.<n>We study the lookup table (LUT)-based approach for mpGEMM and find that a conventional LUT implementation fails to achieve the promised gains.
arXiv Detail & Related papers (2024-08-12T08:52:14Z)
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.05657299071648]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.<n>We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.<n>UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z)
Fast DistilBERT on CPUs [13.29188219884869]
Transformer-based language models have become the standard approach to solving natural language processing tasks. Industry adoption usually requires the maximum throughput to comply with certain latency constraints. We propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators.
arXiv Detail & Related papers (2022-10-27T07:22:50Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.