Related papers: Image and Video Tokenization with Binary Spherical Quantization

Image and Video Tokenization with Binary Spherical Quantization

URL: http://arxiv.org/abs/2406.07548v1
Date: Tue, 11 Jun 2024 17:59:53 GMT
Title: Image and Video Tokenization with Binary Spherical Quantization
Authors: Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl,
Abstract summary: We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ) BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input.
Score: 36.850958591333836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.

Related papers

DVD-Quant: Data-free Video Diffusion Transformers Quantization [98.43940510241768]
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment.<n>We propose DVD-Quant, a novel Data-free quantization framework for Video DiTs.<n>Our approach integrates three key innovations: Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction.
arXiv Detail & Related papers (2025-05-24T11:56:02Z)
Generative Latent Coding for Ultra-Low Bitrate Image and Video Compression [61.500904231491596]
Most approaches for image and video compression perform transform coding in the pixel space to reduce redundancy.<n>We propose textbfGenerative textbfLatent textbfCoding (textbfGLC) models for image and video compression, GLC-image and GLC-Video.
arXiv Detail & Related papers (2025-05-22T03:31:33Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
DeepFGS: Fine-Grained Scalable Coding for Learned Image Compression [27.834491128701963]
This paper proposes a learned fine-grained scalable image compression framework, namely DeepFGS. For entropy coding, we design a mutual entropy model to fully explore the correlation between the basic and scalable features. Experiments demonstrate that our proposed DeepFGS outperforms previous learning-based scalable image compression models.
arXiv Detail & Related papers (2024-11-30T11:19:38Z)
High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency. Uses hierarchical predictive coding to transform each video frame into multiscale representations. Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes. Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
Content-aware Masked Image Modeling Transformer for Stereo Image Compression [15.819672238043786]
We propose a stereo image compression framework, named CAMSIC. CAMSIC transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model. Experiments show that our framework achieves state-of-the-art rate-distortion performance on two stereo image datasets.
arXiv Detail & Related papers (2024-03-13T13:12:57Z)
Video Coding Using Learned Latent GAN Compression [1.6058099298620423]
We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video. Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned.
arXiv Detail & Related papers (2022-07-09T19:07:43Z)
Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks. We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation. We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z)
A New Image Codec Paradigm for Human and Machine Uses [53.48873918537017]
A new scalable image paradigm for both human and machine uses is proposed in this work. The high-level instance segmentation map and the low-level signal features are extracted with neural networks. An image is designed and trained to achieve the general-quality image reconstruction with the 16-bit gray-scale profile and signal features.
arXiv Detail & Related papers (2021-12-19T06:17:38Z)
Transformer-based Image Compression [18.976159633970177]
Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders. TIC rivals with state-of-the-art approaches including deep convolutional neural networks (CNNs) based learnt image coding (LIC) methods and handcrafted rules-based intra profile of recently-approved Versatile Video Coding (VVC) standard.
arXiv Detail & Related papers (2021-11-12T13:13:20Z)
Learned Multi-Resolution Variable-Rate Image Compression with Octave-based Residual Blocks [15.308823742699039]
We propose a new variable-rate image compression framework, which employs generalized octave convolutions (GoConv) and generalized octave transposed-convolutions (GoTConv) To enable a single model to operate with different bit rates and to learn multi-rate image features, a new objective function is introduced. Experimental results show that the proposed framework trained with variable-rate objective function outperforms the standard codecs such as H.265/HEVC-based BPG and state-of-the-art learning-based variable-rate methods.
arXiv Detail & Related papers (2020-12-31T06:26:56Z)
Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images. We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z)
An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding. We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.