Related papers: Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures

Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures

URL: http://arxiv.org/abs/2512.06113v1
Date: Fri, 05 Dec 2025 19:38:34 GMT
Title: Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures
Authors: Bin Xu, Ayan Banerjee, Sandeep Gupta,
Abstract summary: We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline.<n>On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline.
Score: 4.058950730052848
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model Recovery (MR) is a core primitive for physical AI and real-time digital twins, but GPUs often execute MR inefficiently due to iterative dependencies, kernel-launch overheads, underutilized memory bandwidth, and high data-movement latency. We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline. MERINDA exploits on-chip locality through BRAM tiling, fixed-point kernels, and the concurrent use of LUT fabric and carry-chain adders to expose fine-grained spatial parallelism while minimizing off-chip traffic. This hardware-aware formulation removes synchronization bottlenecks and sustains high throughput across the iterative updates in MR. On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline, enabling real-time performance for time-critical physical systems.

Related papers

Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler [0.0]
AI kernel compilation for edge devices depends on the compiler's ability to exploit parallelism and hide memory latency.<n>This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline.
arXiv Detail & Related papers (2026-02-22T19:14:23Z)
Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics [4.058950730052848]
textbfMERINDA (Model Recovery in Reconfigurable Dynamic Architecture) is an FPGA-accelerated MR framework designed to make physical AI practical on resource-constrained devices.<n>We show that MERINDA can bring accurate, explainable MR to the edge for real-time monitoring of autonomous systems.
arXiv Detail & Related papers (2025-12-29T04:51:51Z)
Model Recovery at the Edge under Resource Constraints for Physical AI [4.415937510184061]
We propose a novel FPGA-accelerated Model Recovery framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs.<n> MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs.
arXiv Detail & Related papers (2025-12-01T23:54:23Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Trajectory-aware Shifted State Space Models for Online Video Super-Resolution [57.87099307245989]
This paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba)<n>TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames.<n>Our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% reduction complexity (in MACs)
arXiv Detail & Related papers (2025-08-14T08:42:15Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z)
MINIMALIST: switched-capacitor circuits for efficient in-memory computation of gated recurrent units [0.4941855521192951]
Recurrent neural networks (RNNs) have been a long-standing candidate for processing of temporal sequence data.<n>Recent advances in training paradigms have now inspired new generations of efficient RNNs.<n>We introduce a streamlined and hardware-compatible architecture based on minimal gated recurrent units (GRUs)
arXiv Detail & Related papers (2025-05-13T14:13:41Z)
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [16.65446281180872]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources.<n>Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments.<n>We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
MF-NeRF: Memory Efficient NeRF with Mixed-Feature Hash Table [62.164549651134465]
We propose MF-NeRF, a memory-efficient NeRF framework that employs a Mixed-Feature hash table to improve memory efficiency and reduce training time while maintaining reconstruction quality. Our experiments with state-of-the-art Instant-NGP, TensoRF, and DVGO, indicate our MF-NeRF could achieve the fastest training time on the same GPU hardware with similar or even higher reconstruction quality.
arXiv Detail & Related papers (2023-04-25T05:44:50Z)
Model-Architecture Co-Design for High Performance Temporal GNN Inference on FPGA [5.575293536755127]
Real-world applications require high performance inference on real-time streaming dynamic graphs. We present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs. We train our simplified models using knowledge distillation to ensure similar accuracy vis-'a-vis the original model.
arXiv Detail & Related papers (2022-03-10T00:24:47Z)
Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA [3.4870723728779565]
Sparse Matricized Times Khatri-Rao Product (MTTKRP) is one of the most expensive kernels in tensor computations. This paper focuses on a multi-faceted memory system, which explores the spatial and temporal locality of the data structures of MTTKRP. Our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively.
arXiv Detail & Related papers (2021-09-18T08:19:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.