GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- URL: http://arxiv.org/abs/2507.01006v5
- Date: Fri, 15 Aug 2025 13:23:40 GMT
- Title: GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
- Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang,
- Abstract summary: We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs)<n>GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size.<n>The smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks.
- Score: 116.80874591718323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.
Related papers
- MiMo-VL Technical Report [73.47820531501678]
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models.<n>MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench.<n>For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G.
arXiv Detail & Related papers (2025-06-04T04:32:54Z) - IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs [36.76252153495239]
IV-Bench is the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning.<n> IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks.
arXiv Detail & Related papers (2025-04-21T19:53:44Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning [76.82159851648711]
We propose a framework that dynamically improves the embedding model's representation learning for negative pairs.<n>LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance.<n>LLaVE can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance.
arXiv Detail & Related papers (2025-03-04T10:21:57Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images.
Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning.
We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z) - How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM)
It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - Chain-of-Thought Hub: A Continuous Effort to Measure Large Language
Models' Reasoning Performance [35.38549845444575]
Chain-of-Thought Hub is an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-05-26T23:46:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.