Related papers: Kelix Technical Report

Kelix Technical Report

URL: http://arxiv.org/abs/2602.09843v3
Date: Thu, 12 Feb 2026 15:36:05 GMT
Title: Kelix Technical Report
Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang,
Abstract summary: We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.<n>Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling.
Score: 86.64551727600104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.

Related papers

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer [50.69959748410398]
We introduce MingTok, a new family of visual tokenizers with a continuous latent space for unified autoregressive generation and understanding.<n>MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction.<n>Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm.
arXiv Detail & Related papers (2025-10-08T02:50:14Z)
Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z)
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations [33.11867433769496]
This paper presents a framework that attempts to unify visual understanding and generation within a shared semantic representation.<n>At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary.<n> Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency.
arXiv Detail & Related papers (2025-06-23T17:59:14Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete semantic.<n>Our method improved the understanding performance by 4.8% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%.
arXiv Detail & Related papers (2024-11-26T03:33:52Z)
Continually Learn to Map Visual Concepts to Large Language Models in Resource-constrained Environments [7.481446615819558]
Continual Visual Mapping (CVM) is an approach that continually ground vision representations to a knowledge space extracted from a fixed Language model. CVM overcomes state-of-the-art continual learning methods on five benchmarks.
arXiv Detail & Related papers (2024-07-11T08:28:40Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.