Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
- URL: http://arxiv.org/abs/2508.13460v1
- Date: Tue, 19 Aug 2025 02:36:44 GMT
- Title: Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
- Authors: Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin,
- Abstract summary: This paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning.<n>In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding.
- Score: 16.905045322159953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.
Related papers
- CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding [24.71096142371054]
Large Language Models (LLMs) have achieved remarkable success in source code understanding.<n>As software systems grow in scale, computational efficiency has become a critical bottleneck.
arXiv Detail & Related papers (2026-02-02T08:10:21Z) - Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification [23.26600803714466]
"Compression Tells Intelligence" is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs)<n>This paper provides a comprehensive overview of two dominant technique families -- Visual Coding and Vision Token Technology.<n>We experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC) and embodied AI.
arXiv Detail & Related papers (2026-01-28T16:18:20Z) - VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models [82.05514464090172]
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding.<n>However, their ability to generate code from multimodal inputs remains limited.<n>We introduce VisCodex, a unified framework that seamlessly merges vision and coding language models.
arXiv Detail & Related papers (2025-08-13T17:00:44Z) - Omni-Video: Democratizing Unified Video Understanding and Generation [13.616454543808798]
This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing.<n>Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders.<n>To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements.
arXiv Detail & Related papers (2025-07-08T16:02:16Z) - Token Sequence Compression for Efficient Multimodal Computing [0.19116784879310028]
The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs.<n>We highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for multimodal data.<n>This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.
arXiv Detail & Related papers (2025-04-24T19:11:10Z) - Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models [51.84752285423123]
We introduce a novel metric, $Rank_e$, to quantify the effect of prior knowledge of the vision encoder on MLLM performance.<n>We propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level.<n> Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs.
arXiv Detail & Related papers (2025-03-23T11:33:09Z) - LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models [9.660892239615364]
This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO.<n>Leo is a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling.<n>We show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe.
arXiv Detail & Related papers (2025-01-13T00:29:55Z) - SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks.<n>However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements.<n>We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z) - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions.<n>We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.<n>The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.