Related papers: SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

URL: http://arxiv.org/abs/2412.10443v2
Date: Tue, 17 Dec 2024 03:55:34 GMT
Title: SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization
Authors: Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi, Mingjie Sun, Wenjin Wu, Quan Chen, Peng Jiang,
Abstract summary: SweetTokenizer (SweetTokenizer) is a compact yet effective discretization approach for vision data.<n>Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm.
Score: 20.109136454526233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers' compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a \textbf{C}ross-attention \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only \textbf{25\%} of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by \textbf{32.9\%} w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by \textbf{57.1\%} w.r.t rFVD on UCF-101 and \textbf{37.2\%} w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.

Related papers

LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
ToSA: Token Merging with Spatial Awareness [20.85218319861827]
ToSA is a novel token merging method that combines both semantic and spatial awareness to guide the token merging process.<n>ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering.
arXiv Detail & Related papers (2025-06-24T23:58:20Z)
Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration [24.85624444212476]
This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process.<n>We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors.<n>We introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies.
arXiv Detail & Related papers (2025-05-16T21:27:38Z)
Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models [50.214593234229255]
We introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. We propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens [20.044306399439265]
Large Language Models (LLMs) face computational inefficiencies and redundant processing when handling long context inputs. We propose Dynamic Allocation of Soft Tokens (DAST), a simple yet effective method that leverages the LLM's intrinsic understanding of contextual relevance to guide compression. Experimental results across multiple benchmarks demonstrate that DAST surpasses state-of-the-art methods.
arXiv Detail & Related papers (2025-02-17T06:55:13Z)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models [64.9366388601049]
Visual token compression is leveraged to reduce the considerable token length of visual inputs.<n>We introduce a unified token compression strategy called Progressive Visual Token Compression.<n>Our model achieves state-of-the-art performance across various video understanding benchmarks.
arXiv Detail & Related papers (2024-12-12T18:59:40Z)
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead. We propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks.
arXiv Detail & Related papers (2024-10-06T09:18:04Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation [95.29102596532854]
Tokenizer serves as a translator to map the intricate visual data into a compact latent space. This paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization.
arXiv Detail & Related papers (2024-06-13T17:59:26Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC) SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z)
Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism. This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors. Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z)
Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters. We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation [21.815083817914843]
We propose a new zero-shot video-to-video translation framework, named textitLatentWarp. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space. Experiment results demonstrate the superiority of textitLatentWarp in achieving video-to-video translation with temporal coherence.
arXiv Detail & Related papers (2023-11-01T08:02:57Z)
Context Compression for Auto-regressive Transformers with Sentinel Tokens [37.07722536907739]
We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
arXiv Detail & Related papers (2023-10-12T09:18:19Z)
Spatiotemporal Attention-based Semantic Compression for Real-time Video Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame. We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information. Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z)
MUNet: Motion Uncertainty-aware Semi-supervised Video Object Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation. We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature. We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z)
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections. We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.