Related papers: A Soundness and Precision Benchmark for Java Debloating Tools

A Soundness and Precision Benchmark for Java Debloating Tools

URL: http://arxiv.org/abs/2510.20679v1
Date: Thu, 23 Oct 2025 15:52:20 GMT
Title: A Soundness and Precision Benchmark for Java Debloating Tools
Authors: Jonas Klauke, Tom Ohlmer, Stefan Schott, Serena Elisa Ponta, Wolfram Fischer, Eric Bodden,
Abstract summary: Deblometer is a micro-benchmark consisting of 59 test cases designed to assess support for various Java language features in debloating tools.<n>We evaluate three popular Java debloating tools: Deptrim, JShrink, and ProGuard.<n>Our evaluation reveals that all tools remove required program constructs, which results in changed semantics or execution crashes.
Score: 2.206395510692683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern software development reuses code by importing libraries as dependencies. Software projects typically include an average of 36 dependencies, with 80% being transitive, meaning they are dependencies of dependencies. Recent research indicates that only 24.9% of these dependencies are required at runtime, and even within those, many program constructs remain unused, adding unnecessary code to the project. This has led to the development of debloating tools that remove unnecessary dependencies and program constructs while balancing precision by eliminating unused constructs and soundness by preserving all required constructs. To systematically evaluate this trade-off, we developed Deblometer, a micro-benchmark consisting of 59 test cases designed to assess support for various Java language features in debloating tools. Each test case includes a manually curated ground truth specifying necessary and bloated classes, methods, and fields, enabling precise measurement of soundness and precision. Using Deblometer, we evaluated three popular Java debloating tools: Deptrim, JShrink, and ProGuard. Our evaluation reveals that all tools remove required program constructs, which results in changed semantics or execution crashes. In particular, the dynamic class loading feature introduces unsoundness in all evaluated tools. Our comparison shows that Deptrim retains more bloated constructs, while ProGuard removes more required constructs. JShrink's soundness is significantly affected by limited support for annotations, which leads to corrupted debloated artifacts. These soundness issues highlight the need to improve debloating tools to ensure stable and reliable debloated software.

Related papers

From Separate Compilation to Sound Language Composition [7.697692044735504]
This work introduces nlgcheck, a theoretically sound static analysis tool based on data-flow analysis for the Neverlang language workbench.<n>nlgcheck detects potential runtime errors -- such as undefined attribute accesses -- at compile time, preserving separate compilation while maintaining strong static correctness guarantees.
arXiv Detail & Related papers (2026-02-03T17:38:34Z)
D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning [49.16469288280772]
Decompilers reconstruct human-readable source code from binaries.<n>Despite recent advances, their outputs often suffer from syntactic and semantic errors and remain difficult to read.<n>With the advent of large language models (LLMs), researchers began to explore the potential of LLMs to refine decompiler output.<n>We present D-LIFT, an enhanced decompiler-LLM pipeline with fine-tuned reinforcement learning.
arXiv Detail & Related papers (2025-06-11T19:09:08Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.<n>This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Automatic Build Repair for Test Cases using Incompatible Java Versions [7.4881561767138365]
We introduce an approach to repair test cases of Java projects by performing dependency minimization. Unlike existing state-of-the-art techniques, our approach performs at source-level, which allows compile-time errors to be fixed.
arXiv Detail & Related papers (2024-04-27T07:55:52Z)
Detecting Build Dependency Errors in Incremental Builds [13.823208277774572]
We propose EChecker to detect build dependency errors in the context of incremental builds. EChecker automatically updates actual build dependencies by inferring them from C/C++ pre-processor directives and Makefile changes from new commits. EChecker increases the build dependency error detection efficiency by an average of 85.14 times.
arXiv Detail & Related papers (2024-04-20T07:01:11Z)
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models [74.88844320554284]
We introduce StableToolBench, a benchmark evolving from ToolBench.<n>The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status.<n>The stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation.
arXiv Detail & Related papers (2024-03-12T14:57:40Z)
ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z)
A Broad Comparative Evaluation of Software Debloating Tools [3.0913520619484287]
Software debloating tools seek to improve program security and performance by removing unnecessary code, called bloat. We surveyed 10 years of debloating literature and several tools currently under commercial development to taxonomize knowledge about the debloating ecosystem. Our evaluation, conducted on a diverse set of 20 benchmark programs, measures tools across 12 performance, security, and correctness metrics.
arXiv Detail & Related papers (2023-12-20T18:53:18Z)
On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository. We retrieve over 53k potential vulnerable clones from Maven Central. We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.