Related papers: FeatSharp: Your Vision Model Features, Sharper

FeatSharp: Your Vision Model Features, Sharper

URL: http://arxiv.org/abs/2502.16025v1
Date: Sat, 22 Feb 2025 00:54:49 GMT
Title: FeatSharp: Your Vision Model Features, Sharper
Authors: Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao,
Abstract summary: We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders.<n>We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training.
Score: 64.25786703202414
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones are Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at 224x224px, while the "high resolution" versions are around 378-448px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-res vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model (RADIO) training as a way of providing richer targets for distillation.

Related papers

Rapid Salient Object Detection with Difference Convolutional Neural Networks [49.838283141381716]
This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance.<n>We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs.
arXiv Detail & Related papers (2025-07-01T20:41:05Z)
Image Reconstruction as a Tool for Feature Analysis [2.0249250133493195]
We propose a novel approach for interpreting vision features via image reconstruction.<n>We show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks.<n>Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space.
arXiv Detail & Related papers (2025-06-09T14:32:18Z)
HyperCLIP: Adapting Vision-Language models with Hypernetworks [43.23792024551352]
We propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
arXiv Detail & Related papers (2024-12-21T21:19:08Z)
Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z)
Fill in the blanks: Rethinking Interpretability in vision [0.0]
We re-think vision-model explainability from a novel perspective, to probe the general input structure that a model has learnt during its training. Experiments on standard vision datasets and pre-trained models reveal consistent patterns, and could be intergrated as an additional model-agnostic explainability tool.
arXiv Detail & Related papers (2024-11-15T15:31:06Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
Addressing a fundamental limitation in deep vision models: lack of spatial attention [43.37813040320147]
The aim of this manuscript is to underscore a significant limitation in current deep learning models, particularly vision models. Unlike human vision, which efficiently selects only the essential visual areas for further processing, deep vision models process the entire image. We propose two solutions that could pave the way for the next generation of more efficient vision models.
arXiv Detail & Related papers (2024-07-01T20:21:09Z)
Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z)
Towards Point Cloud Compression for Machine Perception: A Simple and Strong Baseline by Learning the Octree Depth Level Predictor [12.510990055381452]
We propose a point cloud compression framework that simultaneously handles both human and machine vision tasks. Our framework learns a scalable bit-stream, using only subsets for different machine vision tasks to save bit-rate. A new octree depth-level predictor adaptively determines the optimal depth level for each octree constructed from a point cloud.
arXiv Detail & Related papers (2024-06-02T16:13:57Z)
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM. ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z)
Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z)
Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image. Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales. It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z)
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition [185.80889967154963]
We present Vision Permutator, a conceptually simple and data efficient-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, Vision Permutator encodes the feature representations along the height and width dimensions with linear projections. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers.
arXiv Detail & Related papers (2021-06-23T13:05:23Z)
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer. It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.