Related papers: ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

URL: http://arxiv.org/abs/2505.16495v1
Date: Thu, 22 May 2025 10:26:51 GMT
Title: ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation
Authors: Lingfeng Wang, Hualing Lin, Senda Chen, Tao Wang, Changxu Cheng, Yangyang Zhong, Dong Zheng, Wuyue Zhao,
Abstract summary: We propose ALTo, an adaptive length tokenizer for autoregressive mask generation.<n>A novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy.<n>Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost.
Score: 6.618610740026886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at https://github.com/yayafengzi/ALToLLM.

Related papers

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration [12.172968576254469]
We introduce a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level.<n>By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation.<n>This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models.
arXiv Detail & Related papers (2025-06-06T20:24:36Z)
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation [80.90309237362526]
TokLIP is a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens.<n>TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
arXiv Detail & Related papers (2025-05-08T17:12:19Z)
Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Emerging Property of Masked Token for Effective Pre-training [15.846621577804791]
Masked Image Modeling (MIM) has been instrumental in driving recent breakthroughs in computer vision. MIM's overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. We propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens.
arXiv Detail & Related papers (2024-04-12T08:46:53Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training. We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training) The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.