Ray-Patch: An Efficient Querying for Light Field Transformers
- URL: http://arxiv.org/abs/2305.09566v2
- Date: Thu, 17 Aug 2023 09:39:05 GMT
- Title: Ray-Patch: An Efficient Querying for Light Field Transformers
- Authors: T. Berriel Martins and Javier Civera
- Abstract summary: We propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views.
Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models.
- Score: 10.859910783551937
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper we propose the Ray-Patch querying, a novel model to efficiently
query transformers to decode implicit representations into target views. Our
Ray-Patch decoding reduces the computational footprint and increases inference
speed up to one order of magnitude compared to previous models, without losing
global attention, and hence maintaining specific task metrics. The key idea of
our novel querying is to split the target image into a set of patches, then
querying the transformer for each patch to extract a set of feature vectors,
which are finally decoded into the target image using convolutional layers. Our
experimental results, implementing Ray-Patch in 3 different architectures and
evaluating it in 2 different tasks and datasets, demonstrate and quantify the
effectiveness of our method, specifically a notable boost in rendering speed
for the same task metrics.
Related papers
- Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
Referring Image Segmentation [72.27914940012423]
We do an investigation of efficient tuning problems on referring image segmentation.
We propose a novel adapter called Bridger to facilitate cross-modal information exchange.
We also design a lightweight decoder for image segmentation.
arXiv Detail & Related papers (2023-07-21T12:46:15Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.