FeatSharp: Your Vision Model Features, Sharper
- URL: http://arxiv.org/abs/2502.16025v1
- Date: Sat, 22 Feb 2025 00:54:49 GMT
- Title: FeatSharp: Your Vision Model Features, Sharper
- Authors: Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao,
- Abstract summary: We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders.<n>We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training.
- Score: 64.25786703202414
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones are Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at 224x224px, while the "high resolution" versions are around 378-448px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training as a way of providing richer targets for distillation.
Related papers
- HyperCLIP: Adapting Vision-Language models with Hypernetworks [43.23792024551352]
We propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork.
All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end.
HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
arXiv Detail & Related papers (2024-12-21T21:19:08Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - Fill in the blanks: Rethinking Interpretability in vision [0.0]
We re-think vision-model explainability from a novel perspective, to probe the general input structure that a model has learnt during its training.
Experiments on standard vision datasets and pre-trained models reveal consistent patterns, and could be intergrated as an additional model-agnostic explainability tool.
arXiv Detail & Related papers (2024-11-15T15:31:06Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Addressing a fundamental limitation in deep vision models: lack of spatial attention [43.37813040320147]
The aim of this manuscript is to underscore a significant limitation in current deep learning models, particularly vision models.
Unlike human vision, which efficiently selects only the essential visual areas for further processing, deep vision models process the entire image.
We propose two solutions that could pave the way for the next generation of more efficient vision models.
arXiv Detail & Related papers (2024-07-01T20:21:09Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z) - Vision Permutator: A Permutable MLP-Like Architecture for Visual
Recognition [185.80889967154963]
We present Vision Permutator, a conceptually simple and data efficient-like architecture for visual recognition.
By realizing the importance of the positional information carried by 2D feature representations, Vision Permutator encodes the feature representations along the height and width dimensions with linear projections.
We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers.
arXiv Detail & Related papers (2021-06-23T13:05:23Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.