Related papers: InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

URL: http://arxiv.org/abs/2512.16975v1
Date: Thu, 18 Dec 2025 17:13:59 GMT
Title: InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
Authors: Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu,
Abstract summary: Current tokenizers rigidly compress all content at a fixed rate, leading to redundancy or information loss.<n>This paper introduces InfoTok, a principled framework for adaptive video tokenization.<n>We develop a transformer-based adaptive compressor that enables adaptive tokenization.
Score: 114.03378443007074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Related papers

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning [32.030660835097926]
CaCoVID is a novel token selection algorithm for textbfVIDeo understanding (textbfCaCoVID)<n>First, we introduce a reinforcement learning-based framework that prioritizes a policy network to select video token combinations with the greatest contribution to correct predictions.<n> Secondly, we propose a policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations.
arXiv Detail & Related papers (2026-02-02T05:09:48Z)
UniComp: Rethinking Video Compression Through Informational Uniqueness [16.98296446798904]
UniComp aims to maximize the information fidelity of video representations under constrained computational budgets.<n>We introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error.
arXiv Detail & Related papers (2025-12-03T08:56:23Z)
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction [55.66673587952058]
Video understanding models are increasingly limited by the prohibitive storage and computational costs of large-scale datasets.<n>VideoCompressa is a novel framework for video data synthesis that reframes the problem as dynamic latent compression.
arXiv Detail & Related papers (2025-11-24T07:07:58Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Embedding Compression Distortion in Video Coding for Machines [67.97469042910855]
Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis.<n>We propose a Compression Distortion Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models.<n>Our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of execution time, and number of parameters.
arXiv Detail & Related papers (2025-03-27T13:01:53Z)
Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval [16.497758750494537]
We propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism. We leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features. We introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions.
arXiv Detail & Related papers (2023-09-15T05:31:53Z)
Learned Video Compression via Heterogeneous Deformable Compensation Network [78.72508633457392]
We propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance. More specifically, the proposed algorithm extracts features from the two adjacent frames to estimate content-Neighborhood heterogeneous deformable (HetDeform) kernel offsets. Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.
arXiv Detail & Related papers (2022-07-11T02:31:31Z)
High-Efficiency Lossy Image Coding Through Adaptive Neighborhood Information Aggregation [37.02522504535854]
Lossy image coding (LIC) with superior efficiency on both compression performance and throughput is challenging. Our method reports the superior compression performance surpassing the VVC Intra with $approx$15% BD-rate improvement averaged across Kodak, CLIC and Tecnick datasets.
arXiv Detail & Related papers (2022-04-25T05:40:57Z)
Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system. Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.