VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models
- URL: http://arxiv.org/abs/2410.11665v1
- Date: Tue, 15 Oct 2024 14:49:19 GMT
- Title: VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models
- Authors: Zihang Li, Haowen Hou,
- Abstract summary: We present VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs.
Both models support resolutions up to 4096 x 4096 pixels, offering a more detailed and comprehensive visual processing capability.
- Score: 1.03590082373586
- License:
- Abstract: Accurately understanding complex visual information is crucial for visual language models (VLMs). Enhancing image resolution can improve visual perception capabilities, not only reducing hallucinations but also boosting performance in tasks that demand high resolution, such as text-rich or document analysis. In this paper, we present VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs. For VisualRWKV-HD, we developed a lossless downsampling method to effectively integrate a high-resolution vision encoder with low-resolution encoders, without extending the input sequence length. For the VisualRWKV-UHD model, we enhanced image representation by dividing the image into four segments, which are then recombined with the original image. This technique allows the model to incorporate both high-resolution and low-resolution features, effectively balancing coarse and fine-grained information. As a result, the model supports resolutions up to 4096 x 4096 pixels, offering a more detailed and comprehensive visual processing capability. Both VisualRWKV-HD and VisualRWKV-UHD not only achieve strong results on VLM benchmarks but also show marked improvements in performance for text-rich tasks.
Related papers
- FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation [61.61415607972597]
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale.
High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs)
We propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality.
arXiv Detail & Related papers (2025-02-07T18:59:59Z) - Exploring Linear Attention Alternative for Single Image Super-Resolution [28.267177967085143]
Deep learning-based single-image super-resolution (SISR) technology focuses on enhancing low-resolution (LR) images into high-resolution (HR) ones.
We present a novel approach that combines the Receptance Weighted Key Value (RWKV) architecture with feature extraction techniques.
Under the 4x Super-Resolution tasks, compared to the MambaIR model, we achieved an average improvement of 0.26% in PSNR and 0.16% in SSIM.
arXiv Detail & Related papers (2025-02-01T11:39:02Z) - Elevating Flow-Guided Video Inpainting with Reference Generation [50.03502211226332]
Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video.
We propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm.
Our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts.
arXiv Detail & Related papers (2024-12-12T06:13:00Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z) - ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with
Diffusion Models [126.35334860896373]
We investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes.
Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues.
We propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference.
arXiv Detail & Related papers (2023-10-11T17:52:39Z) - Super-Resolution Appearance Transfer for 4D Human Performances [29.361342747786164]
A common problem in the 4D reconstruction of people from multi-view video is the quality of the captured dynamic texture appearance.
We propose a solution through super-resolution appearance transfer from a static high-resolution appearance capture rig.
arXiv Detail & Related papers (2021-08-31T10:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.