Related papers: Interval Analysis in Industrial-Scale BMC Software Verifiers: A Case Study

Related papers

Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making. FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
Fast and Efficient What-If Analyses of Invocation Overhead and Transactional Boundaries to Support the Migration to Microservices [0.3222802562733786]
Microservice architecture improves agility and maintainability of software systems. Decomposing existing software into out-of-process components can have a severe impact on non-functional properties. What-if analyses allow to explore different scenarios and to develop the service boundaries in an iterative and incremental way.
arXiv Detail & Related papers (2025-01-30T09:42:56Z)
Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving [0.0]
Recent advances in large language models (LLMs) have predominantly focused on maximizing accuracy and reasoning capabilities. This paper investigates the potential synergy between reasoning enhancement and computational efficiency by analyzing the integration of two contrasting approaches.
arXiv Detail & Related papers (2024-12-20T08:42:45Z)
Leveraging Slither and Interval Analysis to build a Static Analysis Tool [0.0]
This paper presents our progress toward finding defects that are sometimes not detected or completely detected by state-of-the-art analysis tools. We developed a working solution built on top of Slither that uses interval analysis to evaluate the contract state during the execution of each instruction.
arXiv Detail & Related papers (2024-10-31T09:28:09Z)
Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC) We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions. This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
ESBMC v7.4: Harnessing the Power of Intervals [4.6232063855710654]
ESBMC implements many state-of-the-art techniques for model checking. New features allow us to obtain verification results for unsupported programs and properties. ESBMC employs a new interval analysis of expressions in programs to increase verification performance.
arXiv Detail & Related papers (2023-12-22T14:56:46Z)
Formal Runtime Error Detection During Development in the Automotive Industry [0.1611401281366893]
For safety-relevant automotive software, it is recommended to use sound static program analysis to prove the absence of runtime errors. The analysis is often perceived as burdensome by developers because it runs for a long time and produces many false alarms. In this case study, we present how automatically inferred contracts add context to module-level analysis.
arXiv Detail & Related papers (2023-10-25T08:44:52Z)
Timing Analysis of Embedded Software Updates [1.7027593388928293]
We present RETA, a differential timing analysis technique to verify the impact of an update on the execution time of embedded software. We adapt RETA for integration into aiT, an industrial timing analysis tool, and also develop a complete implementation in a tool called DELTA. We show that RETA decreases aiT's analysis time by 45% and its memory consumption by 8.9%, whereas removing RETA from DELTA, effectively rendering it a regular timing analysis tool, increases its analysis time by 27%.
arXiv Detail & Related papers (2023-04-27T14:19:15Z)
Interpolation and SAT-Based Model Checking Revisited: Adoption to Software Verification [3.9403985159394144]
A formalverification algorithm was originally devised to verify safety properties of finite-state transition systems. Although 20 years old, the algorithm is still state-of-the-art in hardware model checking. Our contribution is to close this significant, two decades old gap in knowledge by adopting the algorithm to software verification.
arXiv Detail & Related papers (2022-08-09T21:33:19Z)
Probabilistically Robust Learning: Balancing Average- and Worst-case Performance [105.87195436925722]
We propose a framework called robustness probabilistic that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning.
arXiv Detail & Related papers (2022-02-02T17:01:38Z)
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark [53.12623958951738]
MQBench is a first attempt to evaluate, analyze, and benchmark the and deployability for model quantization algorithms. We choose multiple platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights.
arXiv Detail & Related papers (2021-11-05T23:38:44Z)
Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z)
Approximation Algorithms for Sparse Principal Component Analysis [57.5357874512594]
Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and statistics. Various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis. We present thresholding as a provably accurate, time, approximation algorithm for the SPCA problem.
arXiv Detail & Related papers (2020-06-23T04:25:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.