Related papers: LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

URL: http://arxiv.org/abs/2504.14032v1
Date: Fri, 18 Apr 2025 18:46:08 GMT
Title: LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Authors: Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang,
Abstract summary: Feature upsampling offers a promising direction to address this challenge.<n>We introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features.<n>Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions.
Score: 27.379438040350188
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

Related papers

JAFAR: Jack up Any Feature at Any Resolution [53.343826346140624]
JAFAR is a lightweight and flexible feature upsampler for Foundation Visions.<n>It enhances the spatial resolution of visual features from any Foundation Vision to an arbitrary target resolution.<n>It generalizes remarkably well to significantly higher output scales.
arXiv Detail & Related papers (2025-06-10T20:53:12Z)
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation [24.531539125814877]
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks.<n>One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution.<n>Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality.
arXiv Detail & Related papers (2025-05-04T11:59:26Z)
DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion [58.36400052566673]
Infrared and visible image fusion integrates information from distinct spectral bands to enhance image quality. Existing approaches treat image fusion and subsequent high-level tasks as separate processes. We propose a Discriminative Cross- Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy.
arXiv Detail & Related papers (2025-03-22T07:01:58Z)
Efficient Feature Fusion for UAV Object Detection [9.632727117779178]
Small objects, in particular, occupy small portions of images, making their accurate detection difficult.<n>Existing multi-scale feature fusion methods address these challenges by aggregating features across different resolutions.<n>We propose a novel feature fusion framework specifically designed for UAV object detection tasks.
arXiv Detail & Related papers (2025-01-29T20:39:16Z)
A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
A popular similarity-based feature upsampling pipeline has been proposed, which utilizes a high-resolution feature as guidance.<n>We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.<n>We develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
arXiv Detail & Related papers (2024-07-02T14:12:21Z)
Wider and Higher: Intensive Integration and Global Foreground Perception for Image Matting [44.51635913732913]
This paper reviews recent deep-learning-based matting research and conceives our wider and higher motivation for image matting. Image matting is essentially a pixel-wise regression, and the ideal situation is to perceive the maximum opacity from the input image. We propose an Intensive Integration and Global Foreground Perception network (I2GFP) to integrate wider and higher feature streams.
arXiv Detail & Related papers (2022-10-13T11:34:46Z)
BIMS-PU: Bi-Directional and Multi-Scale Point Cloud Upsampling [60.257912103351394]
We develop a new point cloud upsampling pipeline called BIMS-PU. We decompose the up/downsampling procedure into several up/downsampling sub-steps by breaking the target sampling factor into smaller factors. We show that our method achieves superior results to state-of-the-art approaches.
arXiv Detail & Related papers (2022-06-25T13:13:37Z)
Multi-Scale Aligned Distillation for Low-Resolution Detection [68.96325141432078]
This paper focuses on boosting the performance of low-resolution models by distilling knowledge from a high- or multi-resolution model. On several instance-level detection tasks and datasets, the low-resolution models trained via our approach perform competitively with high-resolution models trained via conventional multi-scale training.
arXiv Detail & Related papers (2021-09-14T12:53:35Z)
Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition [59.29808528182607]
Cross-resolution face recognition (CRFR) is important in intelligent surveillance and biometric forensics. Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space. In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR.
arXiv Detail & Related papers (2021-03-25T14:03:42Z)
Interpretable Detail-Fidelity Attention Network for Single Image Super-Resolution [89.1947690981471]
We propose a purposeful and interpretable detail-fidelity attention network to progressively process smoothes and details in divide-and-conquer manner. Particularly, we propose a Hessian filtering for interpretable feature representation which is high-profile for detail inference. Experiments demonstrate that the proposed methods achieve superior performances over the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-28T08:31:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.