Scaling Vision Pre-Training to 4K Resolution
- URL: http://arxiv.org/abs/2503.19903v1
- Date: Tue, 25 Mar 2025 17:58:37 GMT
- Title: Scaling Vision Pre-Training to 4K Resolution
- Authors: Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin,
- Abstract summary: We introduce PS3 that scales vision pre-training to 4K resolution with a near-constant cost.<n>PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions.<n>VILA-HD significantly improves high-resolution visual perception compared to baselines without vision pre-training.
- Score: 120.32767371797578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.
Related papers
- AIM 2024 Challenge on Efficient Video Super-Resolution for AV1 Compressed Content [56.552444900457395]
Video super-resolution (VSR) is a critical task for enhancing low-bitrate and low-resolution videos, particularly in streaming applications.
In this work, we compile different methods to address these challenges, the solutions are end-to-end real-time video super-resolution frameworks.
The proposed solutions tackle video up-scaling for two applications: 540p to 4K (x4) as a general case, and 360p to 1080p (x3) more tailored towards mobile devices.
arXiv Detail & Related papers (2024-09-25T18:12:19Z) - InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD [129.9919468062788]
InternLM-XComposer2-4KHD is a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond.
This research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration.
Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements.
arXiv Detail & Related papers (2024-04-09T17:59:32Z) - LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images [119.24323184581974]
We present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution.
Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks.
arXiv Detail & Related papers (2024-03-18T12:04:11Z) - Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras [65.54875149514274]
We present the first approach to render highly realistic free-viewpoint videos of a human actor in general apparel.
At inference, our method only requires four camera views of the moving actor and the respective 3D skeletal pose.
It handles actors in wide clothing, and reproduces even fine-scale dynamic detail.
arXiv Detail & Related papers (2023-12-12T16:45:52Z) - 4K4D: Real-Time 4D View Synthesis at 4K Resolution [86.6582179227016]
This paper targets high-fidelity and real-time view of dynamic 3D scenes at 4K resolution.
We propose a 4D point cloud representation that supports hardwareization and enables unprecedented rendering speed.
Our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an 4090 GPU.
arXiv Detail & Related papers (2023-10-17T17:57:38Z) - 4K-HAZE: A Dehazing Benchmark with 4K Resolution Hazy and Haze-Free
Images [12.402054374952485]
We develop a novel method to simulate 4K hazy images from clear images, which first estimates the scene depth, simulates the light rays and object reflectance, then migrates the synthetic images to real domains by using a GAN.
We wrap these synthesized images into a benchmark called the 4K-HAZE dataset.
The most appealing aspect of our approach is the capability to run a 4K image on a single GPU with 24G RAM in real-time (33fps)
arXiv Detail & Related papers (2023-03-28T09:39:29Z) - Swin Transformer V2: Scaling Up Capacity and Resolution [45.462916348268664]
We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536$times$1,536 resolution.
By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:33Z) - ORStereo: Occlusion-Aware Recurrent Stereo Matching for 4K-Resolution
Images [13.508624751092654]
We present the Occlusion-aware Recurrent binocular Stereo matching (ORStereo)
ORStereo generalizes to unseen high-resolution images with large disparity ranges by formulating the task as residual updates and refinements of an initial prediction.
We test the model's capability on both synthetic and real-world high-resolution images.
arXiv Detail & Related papers (2021-03-13T21:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.