MTVNet: Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions
- URL: http://arxiv.org/abs/2412.03379v2
- Date: Mon, 09 Dec 2024 10:06:22 GMT
- Title: MTVNet: Mapping using Transformers for Volumes -- Network for Super-Resolution with Long-Range Interactions
- Authors: August Leander Høeg, Sophia W. Bardenfleth, Hans Martin Kjer, Tim B. Dyrby, Vedrana Andersen Dahl, Anders Dahl,
- Abstract summary: It has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution.
We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this.
We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets.
- Score: 4.0602274934844615
- License:
- Abstract: Until now, it has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. The memory required for self-attention in 3D volumes limits the receptive field. Therefore, long-range interactions are not used in 3D to the extent done in 2D and the strength of transformers is not realized. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. Here information from larger regions at coarse resolution is sequentially carried on to finer-resolution regions to predict the super-resolved image. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than what has previously been possible. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets demonstrating the advantage of an increased receptive field. This advantage is especially pronounced for images that are larger than what is seen in popularly used 3D datasets. Our code is available at https://github.com/AugustHoeg/MTVNet
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - Monocular Scene Reconstruction with 3D SDF Transformers [17.565474518578178]
We propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation.
Experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction.
arXiv Detail & Related papers (2023-01-31T09:54:20Z) - Memory transformers for full context and high-resolution 3D Medical
Segmentation [76.93387214103863]
This paper introduces the Full resolutIoN mEmory (FINE) transformer to overcome this issue.
The core idea behind FINE is to learn memory tokens to indirectly model full range interactions.
Experiments on the BCV image segmentation dataset shows better performances than state-of-the-art CNN and transformer baselines.
arXiv Detail & Related papers (2022-10-11T10:11:05Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - PatchFormer: A Versatile 3D Transformer Based on Patch Attention [0.358439716487063]
We introduce patch-attention to adaptively learn a much smaller set of bases upon which the attention maps are computed.
By a weighted summation upon these bases, patch-attention not only captures the global shape context but also achieves linear complexity to input size.
Our network achieves strong accuracy on general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.
arXiv Detail & Related papers (2021-10-30T08:39:55Z) - AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation [19.53151547706724]
transformer-based models have drawn attention to exploring these techniques in medical image segmentation.
We propose Axial Fusion Transformer UNet (AFTer-UNet), which takes both advantages of convolutional layers' capability of extracting detailed features and transformers' strength on long sequence modeling.
It has fewer parameters and takes less GPU memory to train than the previous transformer-based models.
arXiv Detail & Related papers (2021-10-20T06:47:28Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z) - Pix2Vox++: Multi-scale Context-aware 3D Object Reconstruction from
Single and Multiple Images [56.652027072552606]
We propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++.
By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image.
A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume.
arXiv Detail & Related papers (2020-06-22T13:48:09Z) - 3D Crowd Counting via Geometric Attention-guided Multi-View Fusion [50.520192402702015]
We propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps.
Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views.
The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density.
arXiv Detail & Related papers (2020-03-18T11:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.