RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning
- URL: http://arxiv.org/abs/2512.09829v1
- Date: Wed, 10 Dec 2025 17:07:19 GMT
- Title: RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning
- Authors: Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque,
- Abstract summary: RIFT (Reinforcement Learning-guided Intelligent Fault Targeting) is a scalable framework that automates the discovery of minimal, high-impact fault scenarios.<n>RIFT transforms the complex search for worst-case faults into a sequential decision-making problem.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2$\times$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over \textbf{99\%} compared to random fault injection, all while achieving \textbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a \textbf{12.8$\times$} improvement in \textbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.
Related papers
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z) - ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation [72.78362530982109]
ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, is a framework that decouples exploration from commitment.<n>We show that naive LLM-based simulators struggle to capture rare but high-impact failure modes.<n>We introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions.
arXiv Detail & Related papers (2026-02-02T06:33:22Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z) - Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z) - White-Basilisk: A Hybrid Model for Code Vulnerability Detection [45.03594130075282]
We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance.<n>White-Basilisk achieves results in vulnerability detection tasks with a parameter count of only 200M.<n>This research establishes new benchmarks in code security and provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks.
arXiv Detail & Related papers (2025-07-11T12:39:25Z) - Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning [45.10424242207931]
Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs)<n>We introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for query generation, evidence extraction, and answer generation.<n>With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers.
arXiv Detail & Related papers (2025-05-20T08:21:00Z) - A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection [13.109309606764754]
We introduce a plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself.<n>Our method achieves state-of-the-art detection performance with negligible computational overhead.
arXiv Detail & Related papers (2025-05-19T00:48:53Z) - Enhancing Smart Contract Vulnerability Detection in DApps Leveraging Fine-Tuned LLM [0.7018579932647147]
Decentralized applications (DApps) face significant security risks due to vulnerabilities in smart contracts.<n>This paper proposes a novel approach leveraging fine-tuned Large Language Models (LLMs) to enhance smart contract vulnerability detection.
arXiv Detail & Related papers (2025-04-07T12:32:14Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery [25.33005185616769]
Robust feed cloud completion (RMC) is a widely used machine learning tool.<n>It simultaneously tackles two critical issues in low-rank data analysis: missing data and extreme outliers.<n>This paper proposes Robust Matrix Learned (LRMC) for largescale RMC problems.<n>The superior empirical performance of LRMC is verified with experiments against state-of-the-art on synthetic datasets and real applications.
arXiv Detail & Related papers (2024-12-31T23:22:12Z) - Towards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis [50.18156030818883]
Anomaly and missing data constitute a thorny problem in industrial applications.
Deep learning enabled anomaly detection has emerged as a critical direction.
The data collected in edge devices contain user privacy.
arXiv Detail & Related papers (2024-11-06T15:38:31Z) - Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection [1.0358639819750703]
In unsupervised anomaly detection (UAD) research, it is necessary to develop a computationally efficient and scalable solution.
We revisit the reconstruction-by-inpainting approach and rethink to improve it by analyzing strengths and weaknesses.
We propose Feature Attenuation of Defective Representation (FADeR) that only employs two layers which attenuates feature information of anomaly reconstruction.
arXiv Detail & Related papers (2024-07-05T15:44:53Z) - Neural Fault Injection: Generating Software Faults from Natural Language [6.050976240234865]
This paper introduces a novel methodology that harnesses the capabilities of Large Language Models (LLMs) augmented with Reinforcement Learning from Human Feedback (RLHF)
The usage of RLHF emphasizes an iterative refinement process, allowing testers to provide feedback on generated faults.
This innovative methodology aims to significantly reduce the manual effort involved in crafting fault scenarios as it allows testers to focus on higher-level testing strategies.
arXiv Detail & Related papers (2024-04-11T05:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.