Transcending the Limit of Local Window: Advanced Super-Resolution
Transformer with Adaptive Token Dictionary
- URL: http://arxiv.org/abs/2401.08209v2
- Date: Thu, 18 Jan 2024 07:59:03 GMT
- Title: Transcending the Limit of Local Window: Advanced Super-Resolution
Transformer with Adaptive Token Dictionary
- Authors: Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, Shuhang Gu
- Abstract summary: Single Image Super-Resolution is a classic computer vision problem that involves estimating high-resolution (HR) images from low-resolution (LR) ones.
We introduce a group of auxiliary Adaptive Token Dictionary to SR Transformer and establish an ATD-SR method.
Our method achieves the best performance on various single image super-resolution benchmarks.
- Score: 30.506135273928596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single Image Super-Resolution is a classic computer vision problem that
involves estimating high-resolution (HR) images from low-resolution (LR) ones.
Although deep neural networks (DNNs), especially Transformers for
super-resolution, have seen significant advancements in recent years,
challenges still remain, particularly in limited receptive field caused by
window-based self-attention. To address these issues, we introduce a group of
auxiliary Adaptive Token Dictionary to SR Transformer and establish an ATD-SR
method. The introduced token dictionary could learn prior information from
training data and adapt the learned prior to specific testing image through an
adaptive refinement step. The refinement strategy could not only provide global
information to all input tokens but also group image tokens into categories.
Based on category partitions, we further propose a category-based
self-attention mechanism designed to leverage distant but similar tokens for
enhancing input features. The experimental results show that our method
achieves the best performance on various single image super-resolution
benchmarks.
Related papers
- ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration [27.622615148357994]
We propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration.<n>We exploit the category information embedded in the TDCA attention maps to group input features into multiple categories.<n>ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks.
arXiv Detail & Related papers (2026-03-03T03:56:09Z) - From Local Windows to Adaptive Candidates via Individualized Exploratory: Rethinking Attention for Image Super-Resolution [20.444907448992154]
Single Image Super-Resolution (SISR) is a fundamental computer vision task that aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input.<n>We propose the Individualized Exploratory Transformer (IET) to enable flexible and token-adaptive attention computation.
arXiv Detail & Related papers (2026-01-13T09:01:20Z) - ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution [71.69364653858447]
Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs.<n>We propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying complexities using different numbers of vision tokens.<n> Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities.
arXiv Detail & Related papers (2025-10-14T17:58:10Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation [35.50570174431677]
We propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions.
We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations.
Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions.
arXiv Detail & Related papers (2025-04-26T08:44:04Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.
We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.
Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z) - Context-Based Visual-Language Place Recognition [4.737519767218666]
A popular approach to vision-based place recognition relies on low-level visual features.
We introduce a novel VPR approach that remains robust to scene changes and does not require additional training.
Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model.
arXiv Detail & Related papers (2024-10-25T06:59:11Z) - SSA-Seg: Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation [11.176993272867396]
In this paper, we propose a novel Semantic and Spatial Adaptive (SSA-Seg) to address the challenges of semantic segmentation.
Specifically, we employ the coarse masks obtained from the fixed prototypes as a guide to adjust the fixed prototype towards the center of the semantic and spatial domains in the test image.
Results show that the proposed SSA-Seg significantly improves the segmentation performance of the baseline models with only a minimal increase in computational cost.
arXiv Detail & Related papers (2024-05-10T15:14:23Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images.
Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance.
We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z) - Low-Resolution Self-Attention for Semantic Segmentation [96.81482872022237]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.
Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.
We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - Learning Resolution-Adaptive Representations for Cross-Resolution Person
Re-Identification [49.57112924976762]
Cross-resolution person re-identification problem aims to match low-resolution (LR) query identity images against high resolution (HR) gallery images.
It is a challenging and practical problem since the query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras.
This paper explores an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric, which is adaptive to the resolution of a query image.
arXiv Detail & Related papers (2022-07-09T03:49:51Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - SRWarp: Generalized Image Super-Resolution under Arbitrary
Transformation [65.88321755969677]
Deep CNNs have achieved significant successes in image processing and its applications, including single image super-resolution.
Recent approaches extend the scope to real-valued upsampling factors.
We propose the SRWarp framework to further generalize the SR tasks toward an arbitrary image transformation.
arXiv Detail & Related papers (2021-04-21T02:50:41Z) - PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of
Generative Models [77.32079593577821]
PULSE (Photo Upsampling via Latent Space Exploration) generates high-resolution, realistic images at resolutions previously unseen in the literature.
Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.
arXiv Detail & Related papers (2020-03-08T16:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.