Related papers: Reliable Detection of Compressed and Encrypted Data

Reliable Detection of Compressed and Encrypted Data

URL: http://arxiv.org/abs/2103.17059v1
Date: Wed, 31 Mar 2021 13:27:28 GMT
Title: Reliable Detection of Compressed and Encrypted Data
Authors: Fabio De Gaspari, Dorjan Hitaj, Giulio Pagnotta, Lorenzo De Carli, Luigi V. Mancini
Abstract summary: ransomware detection, forensics and data analysis require methods to reliably identify encrypted data fragments. Current approaches employ statistics derived from byte-level distribution, such as entropy estimation, to identify encrypted fragments. Modern content types use compression techniques which alter data distribution pushing it closer to the uniform distribution. This paper compares existing statistical tests on a large, standardized dataset and shows that current approaches consistently fail to distinguish encrypted and compressed data.
Score: 1.3439502310822147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Several cybersecurity domains, such as ransomware detection, forensics and data analysis, require methods to reliably identify encrypted data fragments. Typically, current approaches employ statistics derived from byte-level distribution, such as entropy estimation, to identify encrypted fragments. However, modern content types use compression techniques which alter data distribution pushing it closer to the uniform distribution. The result is that current approaches exhibit unreliable encryption detection performance when compressed data appears in the dataset. Furthermore, proposed approaches are typically evaluated over few data types and fragment sizes, making it hard to assess their practical applicability. This paper compares existing statistical tests on a large, standardized dataset and shows that current approaches consistently fail to distinguish encrypted and compressed data on both small and large fragment sizes. We address these shortcomings and design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data. We evaluate EnCoD on a dataset of 16 different file types and fragment sizes ranging from 512B to 8KB. Our results highlight that EnCoD outperforms current approaches by a wide margin, with accuracy ranging from ~82 for 512B fragments up to ~92 for 8KB data fragments. Moreover, EnCoD can pinpoint the exact format of a given data fragment, rather than performing only binary classification like previous approaches.

Related papers

A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection [0.0]
Similarity-based techniques enable approximate matching, allowing related byte sequences to produce measurably similar fingerprints.<n>Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes.<n>This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets.
arXiv Detail & Related papers (2026-02-17T06:16:23Z)
Plaintext Structure Vulnerability: Robust Cipher Identification via a Distributional Randomness Fingerprint Feature Extractor [23.713094083283334]
We present a method that does not learn end-to-end from ciphertext bytes.<n>Specifically, this method is based on a set of statistical tests to compute the randomness feature of the ciphertext.<n>The experimental results demonstrate that our method achieves high discriminative performance.
arXiv Detail & Related papers (2025-11-11T14:29:42Z)
Transformers from Compressed Representations [74.48571451824569]
TEMPEST (TransformErs froM comPressed rEpreSenTations) is a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy.<n>Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage.
arXiv Detail & Related papers (2025-10-26T13:48:03Z)
Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking [51.74368870268278]
We propose TRACE, a framework for fully black-box detection of copyrighted dataset usage in large language models.<n>textttTRACE rewrites datasets with distortion-free watermarks guided by a private key.<n>Across diverse datasets and model families, TRACE consistently achieves significant detections.
arXiv Detail & Related papers (2025-10-03T12:53:02Z)
Learning to Localize Leakage of Cryptographic Sensitive Variables [13.98875599619791]
We develop a principled deep learning framework for determining the relative leakage due to measurements recorded at different points in time. This information is invaluable to cryptographic hardware designers for understanding *why* their hardware leaks.
arXiv Detail & Related papers (2025-03-10T15:42:30Z)
Hidden Data Privacy Breaches in Federated Learning [24.47236055167954]
Federated Learning (FL) emerged as a paradigm for conducting machine learning across broad and decentralized datasets. Recent studies show that attackers can steal private data through model manipulation or gradient analysis. We propose a novel data-reconstruction attack leveraging malicious code injection, supported by two key techniques.
arXiv Detail & Related papers (2024-11-27T12:04:37Z)
ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks [51.03118447290247]
We propose the open-world deepfake detection network (ODDN), which comprises open-world data aggregation (ODA) and compression-discard gradient correction (CGC) ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses. CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in online social networks (OSNs)
arXiv Detail & Related papers (2024-10-24T12:32:22Z)
DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW) DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z)
Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain. We propose an adversarial algorithm to make the retriever component robust against distribution shift. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z)
CrossDF: Improving Cross-Domain Deepfake Detection with Deep Information Decomposition [53.860796916196634]
We propose a Deep Information Decomposition (DID) framework to enhance the performance of Cross-dataset Deepfake Detection (CrossDF) Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over specific visual artifacts. It adaptively decomposes facial features into deepfake-related and irrelevant information, only using the intrinsic deepfake-related information for real/fake discrimination.
arXiv Detail & Related papers (2023-09-30T12:30:25Z)
Anti-Compression Contrastive Facial Forgery Detection [38.69677442287986]
We propose an anti-compression forgery detection framework by maintaining closer relations within data under different compression levels. Experiment results show that the proposed algorithm could boost performance for strong compressed data while improving the accuracy rate when detecting the clean data.
arXiv Detail & Related papers (2023-02-13T08:34:28Z)
Dataset Condensation with Latent Space Knowledge Factorization and Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset. Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes. We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z)
Using Convolutional Neural Networks to Detect Compression Algorithms [0.0]
We use a base dataset, compressed every file with various algorithms, and designed a model based on that. The used model was accurately able to identify files compressed using compress, lzip and bzip2.
arXiv Detail & Related papers (2021-11-17T11:03:16Z)
MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z)
Malware Traffic Classification: Evaluation of Algorithms and an Automated Ground-truth Generation Pipeline [8.779666771357029]
We propose an automated packet data-labeling pipeline to generate ground-truth data. We explore and test different kind of clustering approaches which make use of unique and diverse set of features extracted from this observable meta-data.
arXiv Detail & Related papers (2020-10-22T11:48:51Z)
EnCoD: Distinguishing Compressed and Encrypted File Fragments [0.9239657838690228]
We show that current approaches cannot reliably tell apart encryption and compression, even for large fragment sizes. We design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data, starting with fragments as small as 512 bytes. We evaluate EnCoD against current approaches over a large dataset of different data types, showing that it outperforms current state-of-the-art for most considered fragment sizes and data types.
arXiv Detail & Related papers (2020-10-15T13:55:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.