Related papers: InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

URL: http://arxiv.org/abs/2602.01554v1
Date: Mon, 02 Feb 2026 02:47:48 GMT
Title: InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
Authors: Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li,
Abstract summary: multimodal large language models (MLLMs) integrate image understanding and generation in a single framework.<n>We introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner.<n>Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism.
Score: 29.96158942341168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.

Related papers

Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Quantizing Text-attributed Graphs for Semantic-Structural Integration [6.721504414917793]
Text-attributed graphs (TAGs) have emerged as a powerful representation for modeling complex relationships across diverse domains.<n>With the rise of large language models (LLMs), there is growing interest in leveraging their capabilities for graph learning.<n>We propose STAG, a novel self-supervised framework that directly quantizes graph structural information into discrete tokens using a frozen codebook.
arXiv Detail & Related papers (2025-07-20T09:18:02Z)
Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach [55.861432910722186]
UniToCom is a unified token communication paradigm that treats tokens as the fundamental units for both processing and wireless transmission.<n>We propose a generative information bottleneck (GenIB) principle, which facilitates the learning of tokens that preserve essential information.<n>We employ a causal Transformer-based multimodal large language model (MLLM) at the receiver to unify the processing of both discrete and continuous tokens.
arXiv Detail & Related papers (2025-07-02T14:03:01Z)
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM [21.967692616735196]
multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence.<n>We propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs.<n>This work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.
arXiv Detail & Related papers (2025-05-23T10:43:45Z)
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference [28.24397677839652]
Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models.<n>How MLLMs process and utilize visual information remains unclear.<n>We propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance.
arXiv Detail & Related papers (2025-03-17T12:31:23Z)
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning [10.761218096540976]
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts.<n>We propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing Multimodal Knowledge Graphs.
arXiv Detail & Related papers (2025-03-17T09:31:14Z)
Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)<n>Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.<n>We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC)<n>SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.<n>We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z)
ActiveMLP: An MLP-like Architecture with Active Token Mixer [54.95923719553343]
This paper presents ActiveMLP, a general-like backbone for computer vision. We propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one. In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed.
arXiv Detail & Related papers (2022-03-11T17:29:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.