InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
- URL: http://arxiv.org/abs/2404.06512v1
- Date: Tue, 9 Apr 2024 17:59:32 GMT
- Title: InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
- Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang,
- Abstract summary: InternLM-XComposer2-4KHD is a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond.
This research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration.
Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements.
- Score: 129.9919468062788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The InternLM-XComposer2-4KHD model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
Related papers
- AIM 2024 Challenge on Efficient Video Super-Resolution for AV1 Compressed Content [56.552444900457395]
Video super-resolution (VSR) is a critical task for enhancing low-bitrate and low-resolution videos, particularly in streaming applications.
In this work, we compile different methods to address these challenges, the solutions are end-to-end real-time video super-resolution frameworks.
The proposed solutions tackle video up-scaling for two applications: 540p to 4K (x4) as a general case, and 360p to 1080p (x3) more tailored towards mobile devices.
arXiv Detail & Related papers (2024-09-25T18:12:19Z) - Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey [116.29700317843043]
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution Challenge.
It aims to upscale compressed images from 540p to 4K resolution in real-time on commercial GPUs.
We use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography.
arXiv Detail & Related papers (2024-04-25T10:12:42Z) - ViTAR: Vision Transformer with Any Resolution [80.95324692984903]
Vision Transformers experience a performance decline when processing resolutions different from those seen during training.
We introduce fuzzy positional encoding in the Vision Transformer to provide consistent positional awareness across multiple resolutions.
Our resulting model, ViTAR, demonstrates impressive adaptability, achieving 83.3% top-1 accuracy at a 1120x1120 resolution and 80.4% accuracy at a 4032x4032 resolution.
arXiv Detail & Related papers (2024-03-27T08:53:13Z) - 4K4D: Real-Time 4D View Synthesis at 4K Resolution [86.6582179227016]
This paper targets high-fidelity and real-time view of dynamic 3D scenes at 4K resolution.
We propose a 4D point cloud representation that supports hardwareization and enables unprecedented rendering speed.
Our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an 4090 GPU.
arXiv Detail & Related papers (2023-10-17T17:57:38Z) - Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation [51.26219245226384]
Modern displays are capable of rendering video content with high dynamic range (WCG) and wide color gamut (SDR)
The majority of available resources are still in standard dynamic range (SDR)
We define and analyze the SDRTV-to-TV task by modeling the formation of SDRTV/TV content.
Our method is primarily designed for ultra-high-definition TV content and is therefore effective and lightweight for processing 4K resolution images.
arXiv Detail & Related papers (2023-09-08T02:50:54Z) - Super-Resolution Appearance Transfer for 4D Human Performances [29.361342747786164]
A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance.
We propose a solution through super-resolution appearance transfer from a static high-resolution appearance capture rig.
arXiv Detail & Related papers (2021-08-31T10:53:11Z) - Collapsible Linear Blocks for Super-Efficient Super Resolution [3.5554418329811557]
Single Image Super Resolution (SISR) has become an important computer vision problem.
We propose SESR, a new class of Super-Efficient Super Resolution networks.
Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality.
arXiv Detail & Related papers (2021-03-17T02:16:31Z) - ORStereo: Occlusion-Aware Recurrent Stereo Matching for 4K-Resolution
Images [13.508624751092654]
We present the Occlusion-aware Recurrent binocular Stereo matching (ORStereo)
ORStereo generalizes to unseen high-resolution images with large disparity ranges by formulating the task as residual updates and refinements of an initial prediction.
We test the model's capability on both synthetic and real-world high-resolution images.
arXiv Detail & Related papers (2021-03-13T21:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.