Related papers: From Flows to Words: Can Zero-/Few-Shot LLMs Detect Network Intrusions? A Grammar-Constrained, Calibrated Evaluation on UNSW-NB15

From Flows to Words: Can Zero-/Few-Shot LLMs Detect Network Intrusions? A Grammar-Constrained, Calibrated Evaluation on UNSW-NB15

URL: http://arxiv.org/abs/2510.17883v2
Date: Sun, 26 Oct 2025 07:42:39 GMT
Title: From Flows to Words: Can Zero-/Few-Shot LLMs Detect Network Intrusions? A Grammar-Constrained, Calibrated Evaluation on UNSW-NB15
Authors: Mohammad Abdul Rehman, Syed Imad Ali Shah, Abbas Anwar, Noor Islam,
Abstract summary: Large Language Models (LLMs) can reason over natural-language inputs, but their role in intrusion detection without fine-tuning remains uncertain.<n>This study evaluates a promptonly approach by converting each network flow to a compact textual record and augmenting it with lightweight, domain-inspired flags.<n>We compare zero-shot, instruction-guided, and fewshot prompting to strong neural baselines under identical splits, reporting accuracy, precision, recall, F1, and macro scores.
Score: 0.41998444721319217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) can reason over natural-language inputs, but their role in intrusion detection without fine-tuning remains uncertain. This study evaluates a prompt-only approach on UNSW-NB15 by converting each network flow to a compact textual record and augmenting it with lightweight, domain-inspired boolean flags (asymmetry, burst rate, TTL irregularities, timer anomalies, rare service/state, short bursts). To reduce output drift and support measurement, the model is constrained to produce structured, grammar-valid responses, and a single decision threshold is calibrated on a small development split. We compare zero-shot, instruction-guided, and few-shot prompting to strong tabular and neural baselines under identical splits, reporting accuracy, precision, recall, F1, and macro scores. Empirically, unguided prompting is unreliable, while instructions plus flags substantially improve detection quality; adding calibrated scoring further stabilizes results. On a balanced subset of two hundred flows, a 7B instruction-tuned model with flags reaches macro-F1 near 0.78; a lighter 3B model with few-shot cues and calibration attains F1 near 0.68 on one thousand examples. As the evaluation set grows to two thousand flows, decision quality decreases, revealing sensitivity to coverage and prompting. Tabular baselines remain more stable and faster, yet the prompt-only pipeline requires no gradient training, produces readable artifacts, and adapts easily through instructions and flags. Contributions include a flow-to-text protocol with interpretable cues, a calibration method for thresholding, a systematic baseline comparison, and a reproducibility bundle with prompts, grammar, metrics, and figures.

Related papers

Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z)
Residual Context Diffusion Language Models [90.07635240595926]
Residual Context Diffusion (RCD) is a module that converts discarded token representations into contextual residuals and injects them back for the next denoising step.<n>RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead.
arXiv Detail & Related papers (2026-01-30T13:16:32Z)
SpatialBench-UC: Uncertainty-Aware Evaluation of Spatial Prompt Following in Text-to-Image Generation [0.0]
SpatialBench-UC is a small, reproducible benchmark for pairwise spatial relations.<n>We release a benchmark package, versioned prompts, pinned configs, per-sample checker outputs, and report tables.<n>We evaluate three baselines, Stable Diffusion 1.5, SD 1.5 BoxDiff, and SD 1.4 GLIGEN.
arXiv Detail & Related papers (2026-01-19T23:37:10Z)
Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation [9.099589602551573]
We study universal adversarial suffixes that, when appended to any input, broadly reduce accuracy across tasks and models.<n>Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference.<n>A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence.
arXiv Detail & Related papers (2025-12-09T00:03:39Z)
PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise [60.63315470285562]
MiniTruePrefixes is a novel specialized model that better detects factual inconsistencies over text prefixes.<n>We show that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization.
arXiv Detail & Related papers (2025-11-03T09:07:44Z)
Optimal Detection for Language Watermarks with Pseudorandom Collision [28.84134119819056]
We introduce a statistical framework that captures structure through a hierarchical two-layer partition.<n>At its core is the concept of minimal units -- the smallest groups treatable as independent across units while permitting dependence within.<n>Applying to Gumbel-max and inverse-transform watermarks, our framework produces closed-form optimal rules.
arXiv Detail & Related papers (2025-10-24T20:21:52Z)
Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry [5.1511135538176]
Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts.<n>We propose Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision.
arXiv Detail & Related papers (2025-10-10T17:50:31Z)
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B [1.948261185683419]
We investigate whether "evaluation scent" inflates measured performance without commensurate capability gains.<n>We run six paired A/B scenarios that hold task content and decoding fixed while varying framing.<n>We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts) and practical guidance.
arXiv Detail & Related papers (2025-10-08T09:49:05Z)
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z)
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds [0.0]
We propose SoftPQ, a flexible and interpretable instance segmentation metric.<n>We show that SoftPQ captures meaningful differences in segmentation quality that existing metrics overlook.
arXiv Detail & Related papers (2025-05-17T22:08:33Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction [12.92060812931049]
Minor changes in prompt can cause significant discrepancies in model performance.<n>We propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions.<n>Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts.
arXiv Detail & Related papers (2025-04-04T04:39:51Z)
Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability.<n>Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences.<n>Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z)
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff [49.75167556773752]
Blockwise self-attentional encoder models have emerged as one promising end-to-end approach to simultaneous speech translation. We propose a modified incremental blockwise beam search incorporating local agreement or hold-$n$ policies for quality-latency control.
arXiv Detail & Related papers (2023-09-20T14:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.