Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures
- URL: http://arxiv.org/abs/2512.06113v1
- Date: Fri, 05 Dec 2025 19:38:34 GMT
- Title: Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures
- Authors: Bin Xu, Ayan Banerjee, Sandeep Gupta,
- Abstract summary: We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline.<n>On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline.
- Score: 4.058950730052848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model Recovery (MR) is a core primitive for physical AI and real-time digital twins, but GPUs often execute MR inefficiently due to iterative dependencies, kernel-launch overheads, underutilized memory bandwidth, and high data-movement latency. We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline. MERINDA exploits on-chip locality through BRAM tiling, fixed-point kernels, and the concurrent use of LUT fabric and carry-chain adders to expose fine-grained spatial parallelism while minimizing off-chip traffic. This hardware-aware formulation removes synchronization bottlenecks and sustains high throughput across the iterative updates in MR. On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline, enabling real-time performance for time-critical physical systems.
Related papers
- Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler [0.0]
AI kernel compilation for edge devices depends on the compiler's ability to exploit parallelism and hide memory latency.<n>This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline.
arXiv Detail & Related papers (2026-02-22T19:14:23Z) - Enabling Physical AI at the Edge: Hardware-Accelerated Recovery of System Dynamics [4.058950730052848]
textbfMERINDA (Model Recovery in Reconfigurable Dynamic Architecture) is an FPGA-accelerated MR framework designed to make physical AI practical on resource-constrained devices.<n>We show that MERINDA can bring accurate, explainable MR to the edge for real-time monitoring of autonomous systems.
arXiv Detail & Related papers (2025-12-29T04:51:51Z) - Model Recovery at the Edge under Resource Constraints for Physical AI [4.415937510184061]
We propose a novel FPGA-accelerated Model Recovery framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs.<n> MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs.
arXiv Detail & Related papers (2025-12-01T23:54:23Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - Trajectory-aware Shifted State Space Models for Online Video Super-Resolution [57.87099307245989]
This paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba)<n>TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames.<n>Our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% reduction complexity (in MACs)
arXiv Detail & Related papers (2025-08-14T08:42:15Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge [60.000984252907195]
Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
arXiv Detail & Related papers (2025-05-17T05:00:39Z) - MINIMALIST: switched-capacitor circuits for efficient in-memory computation of gated recurrent units [0.4941855521192951]
Recurrent neural networks (RNNs) have been a long-standing candidate for processing of temporal sequence data.<n>Recent advances in training paradigms have now inspired new generations of efficient RNNs.<n>We introduce a streamlined and hardware-compatible architecture based on minimal gated recurrent units (GRUs)
arXiv Detail & Related papers (2025-05-13T14:13:41Z) - TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [16.65446281180872]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources.<n>Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments.<n>We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - MF-NeRF: Memory Efficient NeRF with Mixed-Feature Hash Table [62.164549651134465]
We propose MF-NeRF, a memory-efficient NeRF framework that employs a Mixed-Feature hash table to improve memory efficiency and reduce training time while maintaining reconstruction quality.
Our experiments with state-of-the-art Instant-NGP, TensoRF, and DVGO, indicate our MF-NeRF could achieve the fastest training time on the same GPU hardware with similar or even higher reconstruction quality.
arXiv Detail & Related papers (2023-04-25T05:44:50Z) - Model-Architecture Co-Design for High Performance Temporal GNN Inference
on FPGA [5.575293536755127]
Real-world applications require high performance inference on real-time streaming dynamic graphs.
We present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs.
We train our simplified models using knowledge distillation to ensure similar accuracy vis-'a-vis the original model.
arXiv Detail & Related papers (2022-03-10T00:24:47Z) - Reconfigurable Low-latency Memory System for Sparse Matricized Tensor
Times Khatri-Rao Product on FPGA [3.4870723728779565]
Sparse Matricized Times Khatri-Rao Product (MTTKRP) is one of the most expensive kernels in tensor computations.
This paper focuses on a multi-faceted memory system, which explores the spatial and temporal locality of the data structures of MTTKRP.
Our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively.
arXiv Detail & Related papers (2021-09-18T08:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.