Related papers: Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

URL: http://arxiv.org/abs/2411.10281v1
Date: Fri, 15 Nov 2024 15:36:48 GMT
Title: Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
Authors: Tim Elsner, Paula Usinger, Julius Nehring-Wirxel, Gregor Kobsik, Victor Czech, Yanjiang He, Isaak Lim, Leif Kobbelt,
Abstract summary: In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. Our work improves tokenisation of visual data by bringing Byte Pair. from 1D to multiple dimensions.
Score: 7.659816122873334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.

Related papers

CAT: Content-Adaptive Image Tokenization [92.2116487267877]
We introduce Content-Adaptive Tokenizer (CAT), which adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
arXiv Detail & Related papers (2025-01-06T16:28:47Z)
Spectral Image Tokenizer [21.84385276311364]
Image tokenizers map images to sequences of discrete tokens. We propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT) We evaluate the tokenizer metrics as multiscale image generation, text-guided image upsampling and editing.
arXiv Detail & Related papers (2024-12-12T18:59:31Z)
Adaptive Length Image Tokenization via Recurrent Allocation [81.10081670396956]
Current vision systems assign fixed-length representations to images, regardless of the information content. Inspired by this, we propose an approach to learn variable-length token representations for 2D images.
arXiv Detail & Related papers (2024-11-04T18:58:01Z)
GlobalMamba: Global Image Serialization for Vision Mamba [73.50475621164037]
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing. We propose a global image serialization method to transform the image into a sequence of causal tokens.
arXiv Detail & Related papers (2024-10-14T09:19:05Z)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z)
Transformer based Pluralistic Image Completion with Reduced Information Loss [72.92754600354199]
Transformer based methods have achieved great success in image inpainting recently. They regard each pixel as a token, thus suffering from an information loss issue. We propose a new transformer based framework called "PUT"
arXiv Detail & Related papers (2024-03-31T01:20:16Z)
Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting [8.572133295533643]
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image.
arXiv Detail & Related papers (2024-03-27T01:28:36Z)
Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z)
Perceptual Image Compression with Cooperative Cross-Modal Side Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z)
Byte Pair Encoding for Symbolic Music [0.0]
Byte Pair embeddings significantly decreases the sequence length while increasing the vocabulary size. We leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github, along with a companion website.
arXiv Detail & Related papers (2023-01-27T20:22:18Z)
Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.