NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
- URL: http://arxiv.org/abs/2511.18452v1
- Date: Sun, 23 Nov 2025 13:43:52 GMT
- Title: NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
- Authors: Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord,
- Abstract summary: Neighborhood Attention Filtering (NAF) learns adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE)<n>NAF operates zero-shot: it upsamples features from any Vision Foundation Models (VFMs) without retraining.<n>It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS.
- Score: 80.55691420311616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.
Related papers
- PUFM++: Point Cloud Upsampling via Enhanced Flow Matching [15.738247394527024]
PUFM++ is an enhanced flow-matching framework for reconstructing point clouds from sparse, noisy, and partial observations.<n>We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better.<n>Experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling.
arXiv Detail & Related papers (2025-12-24T06:30:42Z) - MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization [6.027431240137503]
Cross-view geo-localization aims to determine the geographical location of a query image by matching it against a gallery of images.<n>This task is challenging due to the significant appearance variations of objects observed from variable views, along with the difficulty in extracting discriminative features.<n>Existing approaches often rely on extracting features through feature map segmentation while neglecting spatial and semantic information.
arXiv Detail & Related papers (2025-09-16T04:51:52Z) - Fourier-Guided Attention Upsampling for Image Super-Resolution [0.13999481573773068]
Frequency-Guided Attention (FGA) is a lightweight upsampling module for single image super-resolution.<n>Trials show average PSNR gains of 0.120.14 dB and improved frequency-domain consistency by up to 29%.
arXiv Detail & Related papers (2025-08-14T13:13:17Z) - Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition [83.40450475728792]
We present Freqformer, a Transformer-based framework specifically designed for image demoir'eing through targeted frequency separation.<n>Our method performs an effective frequency decomposition that explicitly splits moir'e patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions.<n>Experiments on various demoir'eing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size.
arXiv Detail & Related papers (2025-05-25T12:23:10Z) - Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation [24.531539125814877]
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks.<n>One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution.<n>Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality.
arXiv Detail & Related papers (2025-05-04T11:59:26Z) - LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models [27.379438040350188]
Feature upsampling offers a promising direction to address this challenge.<n>We introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features.<n>Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions.
arXiv Detail & Related papers (2025-04-18T18:46:08Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - Misalignment-Robust Frequency Distribution Loss for Image Transformation [51.0462138717502]
This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution.
We introduce a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain.
Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain.
arXiv Detail & Related papers (2024-02-28T09:27:41Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - iffDetector: Inference-aware Feature Filtering for Object Detection [70.8678270164057]
We introduce a generic Inference-aware Feature Filtering (IFF) module that can easily be combined with modern detectors.
IFF performs closed-loop optimization by leveraging high-level semantics to enhance the convolutional features.
IFF can be fused with CNN-based object detectors in a plug-and-play manner with negligible computational cost overhead.
arXiv Detail & Related papers (2020-06-23T02:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.