Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs
- URL: http://arxiv.org/abs/2602.00513v1
- Date: Sat, 31 Jan 2026 05:01:01 GMT
- Title: Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs
- Authors: Md Tanvirul Alam, Aritran Piplai, Ionut Cardei, Nidhi Rastogi, Peter J Worth,
- Abstract summary: Large language models (LLMs) show promise for converting noisy, unstructured security artifacts into automation-ready representations.<n>CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs.<n>We propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model.
- Score: 2.702990676892003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textit{Minerva}, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Experiments across LLM backbones show consistent improvements in accuracy and robustness over SFT across multiple benchmarks.
Related papers
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z) - From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs [48.83701310501069]
Large language models (LLMs) have achieved notable performance in code synthesis.<n>We introduce a performance-aware, closed-loop solution that enables LLMs to autonomously engineer optimal transformations.<n>We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions.
arXiv Detail & Related papers (2026-01-07T11:13:02Z) - Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs [38.3239023969819]
Large Language Models (LLMs) have emerged as powerful tools for diverse applications.<n>We identify and propose a novel class of vulnerabilities, termed Tool-Completion Attack (TCA)<n>We introduce Context-Aware Hierarchical Learning (CAHL) to address these vulnerabilities.
arXiv Detail & Related papers (2025-12-03T12:10:21Z) - Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach [1.332091725929965]
This paper presents a deep reinforcement learning framework that combines a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), dynamic reward shaping, and an active learning module to address these issues in a unified learning framework.
arXiv Detail & Related papers (2025-11-15T20:36:20Z) - CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z) - RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning [27.235259453535537]
RationAnomaly is a novel framework that enhances log anomaly detection by synergizing Chain-of-Thought fine-tuning with reinforcement learning.<n>We have released the corresponding resources, including code and datasets.
arXiv Detail & Related papers (2025-09-18T07:35:58Z) - LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification [7.608817324043705]
We propose LRCTI, a framework designed for multi-step Cyber Threat Intelligence credibility verification.<n>The framework first employs a text summarization module to distill complex intelligence reports into concise and actionable threat claims.<n>It then uses an adaptive multi-step evidence retrieval mechanism that iteratively identifies and refines supporting information from a CTI-specific corpus.<n>Experiments conducted on two benchmark datasets, CTI-200 and PolitiFact show that LRCTI improves F1-Macro and F1-Micro scores by over 5%, reaching 90.9% and 93.6%, respectively.
arXiv Detail & Related papers (2025-07-15T13:42:32Z) - Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction [20.1863553357121]
Current deep learning architectures for remote sensing are fundamentally rigid.<n>We introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling.<n> STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands.<n>It unifies various dense prediction tasks and diverse semantic class predictions.
arXiv Detail & Related papers (2025-05-18T07:39:17Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - AttackSeqBench: Benchmarking Large Language Models in Analyzing Attack Sequences within Cyber Threat Intelligence [17.234214109636113]
Cyber Threat Intelligence (CTI) reports document observations of cyber threats, synthesizing evidence about adversaries' actions and intent into actionable knowledge.<n>The unstructured and verbose nature of CTI reports poses significant challenges for security practitioners to manually extract and analyze such sequences.<n>Although large language models (LLMs) exhibit promise in cybersecurity tasks such as entity extraction and knowledge graph construction, their understanding and reasoning capabilities towards behavioral sequences remains underexplored.
arXiv Detail & Related papers (2025-03-05T04:25:21Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - Meta-learning framework with applications to zero-shot time-series
forecasting [82.61728230984099]
This work provides positive evidence using a broad meta-learning framework.
residual connections act as a meta-learning adaptation mechanism.
We show that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining.
arXiv Detail & Related papers (2020-02-07T16:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.