Bytes Are All You Need: Transformers Operating Directly On File Bytes
- URL: http://arxiv.org/abs/2306.00238v2
- Date: Mon, 1 Jul 2024 15:54:17 GMT
- Title: Bytes Are All You Need: Transformers Operating Directly On File Bytes
- Authors: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari,
- Abstract summary: We investigate modality-independent representation learning by performing classification on file bytes, without the need for decoding files at inference time.
Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by $5%$.
We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing.
- Score: 55.81123238702553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate modality-independent representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by $5\%$ (from $72.2\%$ to $77.33\%$) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve $95.42\%$ classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of $98.7\%$). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.
Related papers
- LiteNeXt: A Novel Lightweight ConvMixer-based Model with Self-embedding Representation Parallel for Medical Image Segmentation [2.0901574458380403]
We propose a new lightweight but efficient model, namely LiteNeXt, for medical image segmentation.
LiteNeXt is trained from scratch with small amount of parameters (0.71M) and Giga Floating Point Operations Per Second (0.42).
arXiv Detail & Related papers (2024-04-04T01:59:19Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - A Byte Sequence is Worth an Image: CNN for File Fragment Classification
Using Bit Shift and n-Gram Embeddings [21.14735408046021]
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security.
Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification.
We propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images.
arXiv Detail & Related papers (2023-04-14T08:06:52Z) - Exploring the Limits of Deep Image Clustering using Pretrained Models [1.1060425537315088]
We present a methodology that learns to classify images without labels by leveraging pretrained feature extractors.
We propose a novel objective that learns associations between image features by introducing a variant of pointwise mutual information together with instance weighting.
arXiv Detail & Related papers (2023-03-31T08:56:29Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - Matching Feature Sets for Few-Shot Image Classification [22.84472344406448]
We argue that a set-based representation intrinsically builds a richer representation of images from the base classes.
Our approach, dubbed SetFeat, embeds shallow self-attention mechanisms inside existing encoder architectures.
arXiv Detail & Related papers (2022-04-02T22:42:54Z) - CNNs for JPEGs: A Study in Computational Cost [49.97673761305336]
Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade.
CNNs are capable of learning robust representations of the data directly from the RGB pixels.
Deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years.
arXiv Detail & Related papers (2020-12-26T15:00:10Z) - Shape-Texture Debiased Neural Network Training [50.6178024087048]
Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset.
We develop an algorithm for shape-texture debiased learning.
Experiments show that our method successfully improves model performance on several image recognition benchmarks.
arXiv Detail & Related papers (2020-10-12T19:16:12Z) - Rethinking CNN Models for Audio Classification [20.182928938110923]
ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification.
We systematically study how much of pretrained weights is useful for learning spectrograms.
We show that for a given standard model using pretrained weights is better than using randomly Dense weights.
arXiv Detail & Related papers (2020-07-22T01:31:44Z) - Learning to Learn Parameterized Classification Networks for Scalable
Input Images [76.44375136492827]
Convolutional Neural Networks (CNNs) do not have a predictable recognition behavior with respect to the input resolution change.
We employ meta learners to generate convolutional weights of main networks for various input scales.
We further utilize knowledge distillation on the fly over model predictions based on different input resolutions.
arXiv Detail & Related papers (2020-07-13T04:27:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.