MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
- URL: http://arxiv.org/abs/2602.24222v1
- Date: Fri, 27 Feb 2026 17:48:54 GMT
- Title: MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
- Authors: Albert Dominguez Mantes, Gioele La Manno, Martin Weigert,
- Abstract summary: We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image.<n>Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines.
- Score: 1.9116784879310027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.
Related papers
- Uni-AIMS: AI-Powered Microscopy Image Analysis [28.24402780080126]
We develop a data engine that generates high-quality annotated datasets.<n>We propose a segmentation model capable of robustly detecting both small and large objects.<n>Our solution supports the precise automatic recognition of image scale bars.
arXiv Detail & Related papers (2025-05-11T09:35:53Z) - Mesoscopic Insights: Orchestrating Multi-scale & Hybrid Architecture for Image Manipulation Localization [45.99713338249702]
The mesoscopic level serves as a bridge between the macroscopic and microscopic worlds, addressing gaps overlooked by both.<n>Inspired by this, our paper explores how to simultaneously construct mesoscopic representations of micro and macro information for IML.<n>Our models surpass the current state-of-the-art in terms of performance, computational complexity, and robustness.
arXiv Detail & Related papers (2024-12-18T11:43:41Z) - ZoomLDM: Latent Diffusion Model for multi-scale image generation [57.639937071834986]
We present ZoomLDM, a diffusion model tailored for generating images across multiple scales.<n>Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings.<n> ZoomLDM synthesizes coherent histopathology images that remain contextually accurate and detailed at different zoom levels.
arXiv Detail & Related papers (2024-11-25T22:39:22Z) - CViT: Continuous Vision Transformer for Operator Learning [24.1795082775376]
Continuous Vision Transformer (CViT) is a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems.<n>CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies.<n>We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes.
arXiv Detail & Related papers (2024-05-22T21:13:23Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method.
We embed multi-scale complementary features from the same view position into a set of nodes.
By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Increasing a microscope's effective field of view via overlapped imaging
and machine learning [4.23935174235373]
This work demonstrates a multi-lens microscopic imaging system that overlaps multiple independent fields of view on a single sensor for high-efficiency automated specimen analysis.
arXiv Detail & Related papers (2021-10-10T22:52:36Z) - LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution
Homography Estimation [52.63874513999119]
Cross-resolution image alignment is a key problem in multiscale giga photography.
Existing deep homography methods neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges.
We propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs.
arXiv Detail & Related papers (2021-06-08T02:51:45Z) - Global Voxel Transformer Networks for Augmented Microscopy [54.730707387866076]
We introduce global voxel transformer networks (GVTNets), an advanced deep learning tool for augmented microscopy.
GVTNets are built on global voxel transformer operators (GVTOs), which are able to aggregate global information.
We apply the proposed methods on existing datasets for three different augmented microscopy tasks under various settings.
arXiv Detail & Related papers (2020-08-05T20:11:15Z) - Multi-element microscope optimization by a learned sensing network with
composite physical layers [3.2435888122704037]
Digital microscopes are used to capture images for automated interpretation by computer algorithms.
In this work, we investigate an approach to jointly optimize multiple microscope settings, together with a classification network.
We show that the network's resulting low-resolution microscope images (20X-comparable) offer a machine learning network sufficient contrast to match the classification performance of corresponding high-resolution imagery.
arXiv Detail & Related papers (2020-06-27T16:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.