Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
- URL: http://arxiv.org/abs/2601.19798v1
- Date: Tue, 27 Jan 2026 17:01:16 GMT
- Title: Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
- Authors: Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li,
- Abstract summary: We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
- Score: 79.06371915084833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
Related papers
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z) - ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z) - Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization [42.41263928527529]
Vision-Language-Action (VLA) models can endow agents with transferable world knowledge and vision-language grounding.<n>Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original visual representations and knowledge are preserved.<n>We conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations.
arXiv Detail & Related papers (2025-10-29T15:20:10Z) - ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model [61.29164681694533]
ViPER is a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction.<n>Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability.
arXiv Detail & Related papers (2025-10-28T10:42:57Z) - VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set [80.50996301430108]
The alignment of vision-language representations endows current Vision-Language Models with strong multi-modal reasoning capabilities.<n>We propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations.<n>For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts.
arXiv Detail & Related papers (2025-10-24T10:29:31Z) - Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.48672228625821]
We introduce Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability.<n>Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks.<n>Our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
arXiv Detail & Related papers (2025-10-13T05:51:22Z) - Visual Representation Alignment for Multimodal Large Language Models [38.319869213758686]
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks.<n>But they remain limited in vision-centric tasks such as object counting or spatial reasoning.<n>We present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models.
arXiv Detail & Related papers (2025-09-09T17:59:14Z) - UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [14.089700378708756]
We introduce textbfUP-VLA, a textbfUnified VLA model training with both multi-modal textbfUnderstanding and future textbfPrediction objectives.<n>UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method.
arXiv Detail & Related papers (2025-01-31T03:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.