Qwen2.5-VL Technical Report
- URL: http://arxiv.org/abs/2502.13923v1
- Date: Wed, 19 Feb 2025 18:00:14 GMT
- Title: Qwen2.5-VL Technical Report
- Authors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin,
- Abstract summary: Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition.
It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts.
Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing.
- Score: 57.43576033343722
- License:
- Abstract: We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
Related papers
- Qwen2.5 Technical Report [122.13958993185952]
We introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs.
Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages.
Open-weight offerings include base and instruction-tuned models, with quantized versions available.
For hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus.
arXiv Detail & Related papers (2024-12-19T17:56:09Z) - DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding [39.14141055325595]
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models.
For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios.
For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism.
arXiv Detail & Related papers (2024-12-13T17:37:48Z) - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities [11.53488611812612]
Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices.
We introduce EdgeVL, a novel framework that seamlessly integrates dual-modality knowledge distillation and quantization-aware contrastive learning.
Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.
arXiv Detail & Related papers (2024-03-07T21:34:40Z) - Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond [72.41822115096741]
We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs)
We endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.
The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales.
arXiv Detail & Related papers (2023-08-24T17:59:17Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.