Related papers: Sift or Get Off the PoC: Applying Information Retrieval to Vulnerability Research with SiftRank

Sift or Get Off the PoC: Applying Information Retrieval to Vulnerability Research with SiftRank

URL: http://arxiv.org/abs/2512.06155v1
Date: Fri, 05 Dec 2025 21:09:32 GMT
Title: Sift or Get Off the PoC: Applying Information Retrieval to Vulnerability Research with SiftRank
Authors: Caleb Gross,
Abstract summary: We present SiftRank, a ranking algorithm achieving O(n) complexity through three key mechanisms.<n>SiftRank operates directly on thousands of items, with each document evaluated across multiple randomized batches to mitigate inconsistent judgments.<n>We demonstrate practical effectiveness on N-day vulnerability analysis, successfully identifying a vulnerability-fixing function among 2,197 changed functions in a stripped binary firmware patch within 99 seconds at an inference cost of $0.82.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Security research is fundamentally a problem of resource constraint and consequent prioritization. There is simply too much attack surface and too little time and energy to spend analyzing it all. The most effective security researchers are often those who are most skilled at intuitively deciding which part of an expansive attack surface to investigate. We demonstrate that this problem of selecting the most promising option from among many possibilities can be reframed as an information retrieval problem, and solved using document ranking techniques with LLMs performing the heavy lifting as general-purpose rankers. We present SiftRank, a ranking algorithm achieving O(n) complexity through three key mechanisms: listwise ranking using an LLM to order documents in small batches of approximately 10 items at a time; inflection-based convergence detection that adaptively terminates ranking when score distributions have stabilized; and iterative refinement that progressively focuses ranking effort on the most relevant documents. Unlike existing reranking approaches that require a separate first-stage retrieval step to narrow datasets to approximately 100 candidates, SiftRank operates directly on thousands of items, with each document evaluated across multiple randomized batches to mitigate inconsistent judgments by an LLM. We demonstrate practical effectiveness on N-day vulnerability analysis, successfully identifying a vulnerability-fixing function among 2,197 changed functions in a stripped binary firmware patch within 99 seconds at an inference cost of $0.82. Our approach enables scalable security prioritization for problems that are generally constrained by manual analysis, requiring only standard LLM API access without specialized infrastructure, embedding, or domain-specific fine-tuning. An open-source implementation of SiftRank may be found at https://github.com/noperator/siftrank.

Related papers

The Vulnerability of LLM Rankers to Prompt Injection Attacks [40.03039307576983]
Large Language Models (LLMs) have emerged as powerful re-rankers.<n>Recent research has showed that simple prompt injections embedded within a candidate document can significantly alter an LLM's ranking decisions.
arXiv Detail & Related papers (2026-02-18T06:19:08Z)
Favia: Forensic Agent for Vulnerability-fix Identification and Analysis [5.43098755190303]
We propose Favia, a forensic, agent-based framework for vulnerability-fix identification.<n>Favia combines scalable candidate ranking with deep and iterative semantic reasoning.<n>We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories.
arXiv Detail & Related papers (2026-02-13T00:51:22Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z)
Think Broad, Act Narrow: CWE Identification with Multi-Agent Large Language Models [0.09208007322096533]
Machine learning and large language models (LLMs) for vulnerability detection have received significant attention in recent years.<n>We propose a novel multi-agent LLM approach to address the challenges of identifying security weaknesses (CWEs)<n>In the PrimeVul dataset, Step 1 correctly identifies the appropriate CWE in 40.9% of the studied vulnerable functions.
arXiv Detail & Related papers (2025-08-02T17:57:46Z)
Reasoning with LLMs for Zero-Shot Vulnerability Detection [0.9208007322096533]
We present textbfVulnSage, a comprehensive evaluation framework and a curated dataset from diverse, large-scale open-source system software projects.<n>The framework supports multi-granular analysis across function, file, and inter-function levels.<n>It employs four diverse zero-shot prompt strategies: Baseline, Chain-of-context, Think, and Think & verify.
arXiv Detail & Related papers (2025-03-22T23:59:17Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Guiding Retrieval using LLM-based Listwise Rankers [15.3583908068962]
We propose an adaptation of an existing adaptive retrieval method that supports the listwise setting.<n>Specifically, our proposed algorithm merges results both from the initial ranking and feedback documents.<n>We demonstrate that our method can improve nDCG@10 by up to 13.23% and recall by 28.02%--all while keeping the total number of LLM inferences constant and overheads due to the adaptive process minimal.
arXiv Detail & Related papers (2025-01-15T22:23:53Z)
InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks [9.748438507132207]
Large language models (LLMs) possess extensive knowledge and question-answering capabilities.<n> cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests.<n>We propose a novel timing-based side-channel attack to execute input theft in LLMs inference.
arXiv Detail & Related papers (2024-11-27T10:14:38Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales. We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z)
Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem. Given a query (e.g., a question), return the set of relevant documents from a large document corpus. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.