CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
- URL: http://arxiv.org/abs/2504.19093v1
- Date: Sun, 27 Apr 2025 03:41:17 GMT
- Title: CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
- Authors: Yu Li, Qizhi Pei, Mengyuan Sun, Honglin Lin, Chenlin Ming, Xin Gao, Jiang Wu, Conghui He, Lijun Wu,
- Abstract summary: We introduce CipherBank, a benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) in cryptographic decryption tasks.<n>We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1.
- Score: 42.16123880046729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring cryptographic expertise remain underexplored. In this paper, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously crafted problems, covering 262 unique plaintexts across 5 domains and 14 subdomains, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques. We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data. Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning. These findings underscore the need for continuous advancements in LLM reasoning capabilities.
Related papers
- An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z) - The Code Barrier: What LLMs Actually Understand? [7.407441962359689]
This research uses code obfuscation as a structured testing framework to evaluate semantic understanding capabilities of language models.<n>Findings show a statistically significant performance decline as obfuscation complexity increases.<n>This research introduces a new evaluation approach for assessing code comprehension in language models.
arXiv Detail & Related papers (2025-04-14T14:11:26Z) - Post-Quantum Homomorphic Encryption: A Case for Code-Based Alternatives [0.6749750044497732]
Homomorphic Encryption (HE) allows secure and privacy-protected computation on encrypted data without the need to decrypt it.<n>Most of the current PQHE algorithms are secured by lattice-based problems.<n>Code-based encryption is a novel way to diversify post-quantum algorithms.
arXiv Detail & Related papers (2025-03-28T06:49:22Z) - Cryptanalysis via Machine Learning Based Information Theoretic Metrics [58.96805474751668]
We propose two novel applications of machine learning (ML) algorithms to perform cryptanalysis on any cryptosystem.<n>These algorithms can be readily applied in an audit setting to evaluate the robustness of a cryptosystem.<n>We show that our classification model correctly identifies the encryption schemes that are not IND-CPA secure, such as DES, RSA, and AES ECB, with high accuracy.
arXiv Detail & Related papers (2025-01-25T04:53:36Z) - Decoding Secret Memorization in Code LLMs Through Token-Level Characterization [6.92858396995673]
Code Large Language Models (LLMs) have demonstrated remarkable capabilities in generating, understanding, and manipulating programming code.<n>LLMs inadvertently lead to the memorization of sensitive information, posing severe privacy risks.<n>We present a novel approach to characterize real and fake secrets generated by Code LLMs based on token probabilities.
arXiv Detail & Related papers (2024-10-11T14:39:24Z) - CodeCipher: Learning to Obfuscate Source Code Against LLMs [5.872773591957006]
We propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs.
CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code.
Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.
arXiv Detail & Related papers (2024-10-08T08:28:54Z) - A Machine Learning-Based Framework for Assessing Cryptographic Indistinguishability of Lightweight Block Ciphers [1.5953412143328967]
Indistinguishability is a fundamental principle of cryptographic security, crucial for securing data transmitted between Internet of Things (IoT) devices.
This research investigates the ability of machine learning (ML) in assessing indistinguishability property in encryption systems.
We introduce MIND-Crypt, a novel ML-based framework designed to assess the cryptographic indistinguishability of lightweight block ciphers.
arXiv Detail & Related papers (2024-05-30T04:40:13Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models [49.60006012946767]
We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics.
We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR)
Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
arXiv Detail & Related papers (2024-02-26T16:35:59Z) - GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [85.18213923151717]
Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains.
We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
arXiv Detail & Related papers (2023-08-12T04:05:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.