Related papers: Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

URL: http://arxiv.org/abs/2601.05927v1
Date: Fri, 09 Jan 2026 16:41:08 GMT
Title: Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
Authors: Yohann Perron, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu,
Abstract summary: Current approaches for segmenting ultra high resolution images either slide a window, or downsample and lose fine detail.<n>We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness.
Score: 12.757251643358067
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .

Related papers

FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection [9.291995455336929]
We propose a Focal Token Acquring-and-Scaling Transformer (FASTer)<n>FASTer condenses token sequences in an adaptive and lightweight manner.<n>It significantly outperforms other state-of-the-art detectors in both performance and efficiency.
arXiv Detail & Related papers (2025-02-28T03:15:33Z)
Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining [47.15857899099733]
We develop an end-to-end multi-scale Transformer to facilitate high-quality image reconstruction. We incorporate intra-scale implicit neural representations based on pixel coordinates with the degraded inputs in a closed-loop design. Our approach, named as NeRD-Rain, performs favorably against the state-of-the-art ones on both synthetic and real-world benchmark datasets.
arXiv Detail & Related papers (2024-04-02T01:18:16Z)
Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers. Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z)
Representation Separation for Semantic Segmentation with Vision Transformers [11.431694321563322]
Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation. We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs.
arXiv Detail & Related papers (2022-12-28T09:54:52Z)
Memory transformers for full context and high-resolution 3D Medical Segmentation [76.93387214103863]
This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue. The core idea behind FINE is to learn memory tokens to indirectly model full range interactions. Experiments on the BCV image segmentation dataset shows better performances than state-of-the-art CNN and transformer baselines.
arXiv Detail & Related papers (2022-10-11T10:11:05Z)
Illumination Adaptive Transformer [66.50045722358503]
We propose a lightweight fast Illumination Adaptive Transformer (IAT) IAT decomposes the light transformation pipeline into local and global ISP components. We have extensively evaluated IAT on multiple real-world datasets.
arXiv Detail & Related papers (2022-05-30T06:21:52Z)
Any-resolution Training for High-resolution Image Synthesis [55.19874755679901]
Generative models operate at fixed resolution, even though natural images come in a variety of sizes. We argue that every pixel matters and create datasets with variable-size images, collected at their native resolutions. We introduce continuous-scale training, a process that samples patches at random scales to train a new generator with variable output resolutions.
arXiv Detail & Related papers (2022-04-14T17:59:31Z)
Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision. We offer three insights based on simple and easy to implement variants of vision transformers. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z)
EDTER: Edge Detection with Transformer [71.83960813880843]
We propose a novel transformer-based edge detector, emphEdge Detection TransformER (EDTER), to extract clear and crisp object boundaries and meaningful edges. EDTER exploits the full image context information and detailed local cues simultaneously. Experiments on BSDS500, NYUDv2, and Multicue demonstrate the superiority of EDTER in comparison with state-of-the-arts.
arXiv Detail & Related papers (2022-03-16T11:55:55Z)
Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.