A Byte Sequence is Worth an Image: CNN for File Fragment Classification
Using Bit Shift and n-Gram Embeddings
- URL: http://arxiv.org/abs/2304.06983v1
- Date: Fri, 14 Apr 2023 08:06:52 GMT
- Title: A Byte Sequence is Worth an Image: CNN for File Fragment Classification
Using Bit Shift and n-Gram Embeddings
- Authors: Wenyang Liu, Yi Wang, Kejun Wu, Kim-Hui Yap and Lap-Pui Chau
- Abstract summary: File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security.
Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification.
We propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images.
- Score: 21.14735408046021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: File fragment classification (FFC) on small chunks of memory is essential in
memory forensics and Internet security. Existing methods mainly treat file
fragments as 1d byte signals and utilize the captured inter-byte features for
classification, while the bit information within bytes, i.e., intra-byte
information, is seldom considered. This is inherently inapt for classifying
variable-length coding files whose symbols are represented as the variable
number of bits. Conversely, we propose Byte2Image, a novel data augmentation
technique, to introduce the neglected intra-byte information into file
fragments and re-treat them as 2d gray-scale images, which allows us to capture
both inter-byte and intra-byte correlations simultaneously through powerful
convolutional neural networks (CNNs). Specifically, to convert file fragments
to 2d images, we employ a sliding byte window to expose the neglected
intra-byte information and stack their n-gram features row by row. We further
propose a byte sequence \& image fusion network as a classifier, which can
jointly model the raw 1d byte sequence and the converted 2d image to perform
FFC. Experiments on FFT-75 dataset validate that our proposed method can
achieve notable accuracy improvements over state-of-the-art methods in nearly
all scenarios. The code will be released at
https://github.com/wenyang001/Byte2Image.
Related papers
- ByteNet: Rethinking Multimedia File Fragment Classification through Visual Perspectives [23.580848165023962]
Multimedia file fragment classification (MFFC) aims to identify file fragment types without system metadata.
Existing MFFC methods treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification.
Byte2Image incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D images.
ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network.
arXiv Detail & Related papers (2024-10-28T09:19:28Z) - GlobalMamba: Global Image Serialization for Vision Mamba [73.50475621164037]
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens.
Most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing.
We propose a global image serialization method to transform the image into a sequence of causal tokens.
arXiv Detail & Related papers (2024-10-14T09:19:05Z) - Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks [2.9835839258066015]
We introduce a memory-efficient CNN (convolutional neural network) for on-device vision tasks.
The proposed network classifies ImageNet with extremely low memory (i.e., 63 KB) while achieving competitive top-1 accuracy (i.e., 61.58%)
To the best of our knowledge, the memory usage of the proposed network is far smaller than state-of-the-art memory-efficient networks.
arXiv Detail & Related papers (2024-08-07T10:04:04Z) - UniGS: Unified Representation for Image Generation and Segmentation [105.08152635402858]
We use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers.
Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation.
arXiv Detail & Related papers (2023-12-04T15:59:27Z) - Bytes Are All You Need: Transformers Operating Directly On File Bytes [55.81123238702553]
We investigate modality-independent representation learning by performing classification on file bytes, without the need for decoding files at inference time.
Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by $5%$.
We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing.
arXiv Detail & Related papers (2023-05-31T23:18:21Z) - Transform and Bitstream Domain Image Classification [2.4366811507669124]
This paper proposes two such methods as a proof of concept.
The first classifies within the JPEG image transform domain (i.e. DCT transform data); the second classifies the JPEG compressed binary bitstream directly.
Top-1 accuracy of approximately 70% and 60% were achieved when classifying the Caltech C101 database.
arXiv Detail & Related papers (2021-10-13T14:18:58Z) - byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings [77.6701264226519]
We introduce byteSteady, a fast model for classification using byte-level n-gram embeddings.
A straightforward application of byteSteady is text classification.
We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification.
arXiv Detail & Related papers (2021-06-24T20:14:48Z) - Memory-guided Unsupervised Image-to-image Translation [54.1903150849536]
We present an unsupervised framework for instance-level image-to-image translation.
We show that our model outperforms recent instance-level methods.
arXiv Detail & Related papers (2021-04-12T03:02:51Z) - Two-stage generative adversarial networks for document image
binarization with color noise and background removal [7.639067237772286]
We propose a two-stage color document image enhancement and binarization method using generative adversarial neural networks.
In the first stage, four color-independent adversarial networks are trained to extract color foreground information from an input image.
In the second stage, two independent adversarial networks with global and local features are trained for image binarization of documents of variable size.
arXiv Detail & Related papers (2020-10-20T07:51:50Z) - FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning [64.32306537419498]
We propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations.
These transformations also use information from both within-class and across-class representations that we extract through clustering.
We demonstrate that our method is comparable to current state of art for smaller datasets while being able to scale up to larger datasets.
arXiv Detail & Related papers (2020-07-16T17:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.