Related papers: VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

URL: http://arxiv.org/abs/2505.12053v1
Date: Sat, 17 May 2025 15:32:54 GMT
Title: VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption
Authors: Tianxiong Zhong, Xingye Tian, Boyuan Jiang, Xuebo Wang, Xin Tao, Pengfei Wan, Zhiwei Zhang,
Abstract summary: Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate.<n>The paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding.<n> VFRTok achieves competitive reconstruction quality and state-of-the-art video fidelity while using only 1/8 tokens compared to existing tokenizers.
Score: 19.263984982252396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.

Related papers

VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate [16.826081397057774]
VGDFR is a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate.<n>We show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.
arXiv Detail & Related papers (2025-04-16T17:09:13Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation [16.216254819711327]
We propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space.<n>Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models.
arXiv Detail & Related papers (2025-02-17T15:22:31Z)
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.<n>We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z)
Improved Video VAE for Latent Video Diffusion Model [55.818110540710215]
Video Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora. Most of existing VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression. We propose a new KTC architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE)
arXiv Detail & Related papers (2024-11-10T12:43:38Z)
High-Efficiency Neural Video Compression via Hierarchical Predictive Learning [27.41398149573729]
Enhanced Deep Hierarchical Video Compression-DHVC 2.0- introduces superior compression performance and impressive complexity efficiency. Uses hierarchical predictive coding to transform each video frame into multiscale representations. Supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss.
arXiv Detail & Related papers (2024-10-03T15:40:58Z)
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter [77.0205013713008]
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision models. We propose a sparse-andcorrelated AdaPter (RAP) to fine-tune the pre-trained model with a few parameterized layers.
arXiv Detail & Related papers (2024-05-29T19:23:53Z)
Boosting Neural Representations for Videos with a Conditional Decoder [28.073607937396552]
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing. This paper introduces a universal boosting framework for current implicit video representation approaches.
arXiv Detail & Related papers (2024-02-28T08:32:19Z)
A Codec Information Assisted Framework for Efficient Compressed Video Super-Resolution [15.690562510147766]
Video Super-Resolution (VSR) using recurrent neural network architecture is a promising solution due to its efficient modeling of long-range temporal dependencies. We propose a Codec Information Assisted Framework (CIAF) to boost and accelerate recurrent VSR models for compressed videos.
arXiv Detail & Related papers (2022-10-15T08:48:29Z)
Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring. FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames. Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z)
End-to-End Learning for Video Frame Compression with Self-Attention [25.23586503813838]
We propose an end-to-end learned system for compressing video frames. Our system learns deep embeddings of frames and encodes their difference in latent space. In our experiments, we show that the proposed system achieves high compression rates and high objective visual quality.
arXiv Detail & Related papers (2020-04-20T12:11:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.