From Obfuscated to Obvious: A Comprehensive JavaScript Deobfuscation Tool for Security Analysis
- URL: http://arxiv.org/abs/2512.14070v1
- Date: Tue, 16 Dec 2025 04:13:09 GMT
- Title: From Obfuscated to Obvious: A Comprehensive JavaScript Deobfuscation Tool for Security Analysis
- Authors: Dongchao Zhou, Lingyun Ying, Huajun Chai, Dongbin Wang,
- Abstract summary: JSIMPLIFIER is a comprehensive deobfuscation tool using a multi-stage pipeline with preprocessing.<n>We construct and release the largest real-world obfuscated JavaScript dataset with 44,421 samples.<n>Our results advance benchmarks for JavaScript deobfuscation research and practical security applications.
- Score: 6.038443052154118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: JavaScript's widespread adoption has made it an attractive target for malicious attackers who employ sophisticated obfuscation techniques to conceal harmful code. Current deobfuscation tools suffer from critical limitations that severely restrict their practical effectiveness. Existing tools struggle with diverse input formats, address only specific obfuscation types, and produce cryptic output that impedes human analysis. To address these challenges, we present JSIMPLIFIER, a comprehensive deobfuscation tool using a multi-stage pipeline with preprocessing, abstract syntax tree-based static analysis, dynamic execution tracing, and Large Language Model (LLM)-enhanced identifier renaming. We also introduce multi-dimensional evaluation metrics that integrate control/data flow analysis, code simplification assessment, entropy measures and LLM-based readability assessments. We construct and release the largest real-world obfuscated JavaScript dataset with 44,421 samples (23,212 wild malicious + 21,209 benign samples). Evaluation shows JSIMPLIFIER outperforms existing tools with 100% processing capability across 20 obfuscation techniques, 100% correctness on evaluation subsets, 88.2% code complexity reduction, and over 4-fold readability improvement validated by multiple LLMs. Our results advance benchmarks for JavaScript deobfuscation research and practical security applications.
Related papers
- RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - Multi-Agent Taint Specification Extraction for Vulnerability Detection [49.27772068704498]
Static Application Security Testing (SAST) tools using taint analysis are widely viewed as providing higher-quality vulnerability detection results.<n>We present SemTaint, a multi-agent system that strategically combines the semantic understanding of Large Language Models (LLMs) with traditional static program analysis.<n>We integrate SemTaint with CodeQL, a state-of-the-art SAST tool, and demonstrate its effectiveness by detecting 106 of 162 vulnerabilities previously undetectable by CodeQL.
arXiv Detail & Related papers (2026-01-15T21:31:51Z) - OBsmith: Testing JavaScript Obfuscator using LLM-powered sketching [11.58496128577643]
JavaScript obfuscators are widely deployed to protect intellectual property and resist reverse engineering.<n>Existing evaluations measure resistance to deobfuscation, leaving the critical question of whether obfuscators preserve semantics unanswered.<n>We present OBsmith, a novel framework to systematically test JavaScript obfuscators using large language models.
arXiv Detail & Related papers (2025-10-11T07:02:42Z) - "Digital Camouflage": The LLVM Challenge in LLM-Based Malware Detection [0.0]
Large Language Models (LLMs) have emerged as promising tools for malware detection.<n>However, their reliability under adversarial compiler-level obfuscation is yet to be discovered.<n>This study empirically evaluate the robustness of three state-of-the-art LLMs against compiler-level obfuscation techniques.
arXiv Detail & Related papers (2025-09-20T12:47:36Z) - CASCADE: LLM-Powered JavaScript Deobfuscator at Google [1.7266435334810277]
Software obfuscation, particularly prevalent in JavaScript, hinders code comprehension and analysis.<n>This paper introduces CASCADE, a novel hybrid approach that integrates the advanced coding capabilities of Gemini with the deterministic transformation capabilities of a compiler.<n>CASCADE is already deployed in Google's production environment, demonstrating substantial improvements in JavaScript deobfuscation efficiency.
arXiv Detail & Related papers (2025-07-23T16:57:32Z) - JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation [34.88009582470047]
Large Language Models (LLMs) have recently shown promise in automating the deobfuscation process.<n>We present JsDeObsBench, a benchmark designed to rigorously evaluate the effectiveness of LLMs in the context of JS deobfuscation.
arXiv Detail & Related papers (2025-06-25T06:50:13Z) - Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z) - ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.<n>Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.<n>In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z) - ShadowCode: Towards (Automatic) External Prompt Injection Attack against Code LLMs [56.46702494338318]
This paper introduces a new attack paradigm: (automatic) external prompt injection against code-oriented large language models.<n>We propose ShadowCode, a simple yet effective method that automatically generates induced perturbations based on code simulation.<n>We evaluate our method across 13 distinct malicious objectives, generating 31 threat cases spanning three popular programming languages.
arXiv Detail & Related papers (2024-07-12T10:59:32Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - FV8: A Forced Execution JavaScript Engine for Detecting Evasive Techniques [53.288368877654705]
FV8 is a modified V8 JavaScript engine designed to identify evasion techniques in JavaScript code.
It selectively enforces code execution on APIs that conditionally inject dynamic code.
It identifies 1,443 npm packages and 164 (82%) extensions containing at least one type of evasion.
arXiv Detail & Related papers (2024-05-21T19:54:19Z) - Cryptic Bytes: WebAssembly Obfuscation for Evading Cryptojacking Detection [0.0]
We present the most comprehensive evaluation of code obfuscation techniques for WebAssembly to date.
We obfuscate a diverse set of applications, including utilities, games, and crypto miners, using state-of-the-art obfuscation tools like Tigress and wasm-mutate.
Our dataset of over 20,000 obfuscated WebAssembly binaries and the emcc-obf tool publicly available to stimulate further research.
arXiv Detail & Related papers (2024-03-22T13:32:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.