Related papers: MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

URL: http://arxiv.org/abs/2507.04559v1
Date: Sun, 06 Jul 2025 22:23:27 GMT
Title: MambaVideo for Discrete Video Tokenization with Channel-Split Quantization
Authors: Dawit Mureja Argaw, Xian Liu, Joon Son Chung, Ming-Yu Liu, Fitsum Reda,
Abstract summary: This work introduces a state-of-the-art discrete video tokenizer with two key contributions.<n>First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers.<n>Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents.
Score: 34.23941517563312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

Related papers

LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers [31.95947876513405]
We present QDi-VT, a quantization framework specifically designed for video DiT models.<n>From the quantization perspective, we propose the Token-aware Quantization Estor (TQE), which compensates for quantization errors in both the token and feature dimensions.<n>Our W3A6 QDi-VT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$times$.
arXiv Detail & Related papers (2025-05-28T09:33:52Z)
Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation [0.0]
We present a conditional encoder designed to adapt an image-to-video model for a large-motion frame.<n>To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism.<n>Our approach demonstrates superior performance on the Fr'teche Video Distance metric when evaluated against other state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-22T14:49:55Z)
VidTok: A Versatile and Open-Source Video Tokenizer [24.018360305535307]
VidTok is a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations.<n>By integrating these advancements, VidTok achieves substantial improvements over existing methods.
arXiv Detail & Related papers (2024-12-17T16:27:11Z)
MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.<n>First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.<n>Second, we present MotionAura, a text-to-video generation framework.<n>Third, we propose a spectral transformer-based denoising network.<n>Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
ResQ: Residual Quantization for Video Perception [18.491197847596283]
We propose a novel quantization scheme for video networks coined as Residual Quantization. We extend our model to dynamically adjust the bit-width proportional to the amount of changes in the video.
arXiv Detail & Related papers (2023-08-18T12:41:10Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs) The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved. We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.