Image and Video Tokenization with Binary Spherical Quantization
- URL: http://arxiv.org/abs/2406.07548v1
- Date: Tue, 11 Jun 2024 17:59:53 GMT
- Title: Image and Video Tokenization with Binary Spherical Quantization
- Authors: Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl,
- Abstract summary: We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ)
BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization.
Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input.
- Score: 36.850958591333836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.
Related papers
- DeepFGS: Fine-Grained Scalable Coding for Learned Image Compression [27.834491128701963]
This paper proposes a learned fine-grained scalable image compression framework, namely DeepFGS.
For entropy coding, we design a mutual entropy model to fully explore the correlation between the basic and scalable features.
Experiments demonstrate that our proposed DeepFGS outperforms previous learning-based scalable image compression models.
arXiv Detail & Related papers (2024-11-30T11:19:38Z) - High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency.
Uses hierarchical predictive coding to transform each video frame into multiscale representations.
Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression [15.819672238043786]
We propose a stereo image compression framework, named CAMSIC.
CAMSIC transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model.
Experiments show that our framework achieves state-of-the-art rate-distortion performance.
arXiv Detail & Related papers (2024-03-13T13:12:57Z) - Video Coding Using Learned Latent GAN Compression [1.6058099298620423]
We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video.
Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned.
arXiv Detail & Related papers (2022-07-09T19:07:43Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - A New Image Codec Paradigm for Human and Machine Uses [53.48873918537017]
A new scalable image paradigm for both human and machine uses is proposed in this work.
The high-level instance segmentation map and the low-level signal features are extracted with neural networks.
An image is designed and trained to achieve the general-quality image reconstruction with the 16-bit gray-scale profile and signal features.
arXiv Detail & Related papers (2021-12-19T06:17:38Z) - Transformer-based Image Compression [18.976159633970177]
Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders.
TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-11-12T13:13:20Z) - Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images.
We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z) - An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond
Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding.
We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern.
By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.