Related papers: NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

URL: http://arxiv.org/abs/2511.18452v1
Date: Sun, 23 Nov 2025 13:43:52 GMT
Title: NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
Authors: Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord,
Abstract summary: Neighborhood Attention Filtering (NAF) learns adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE)<n>NAF operates zero-shot: it upsamples features from any Vision Foundation Models (VFMs) without retraining.<n>It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS.
Score: 80.55691420311616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

Related papers

PUFM++: Point Cloud Upsampling via Enhanced Flow Matching [15.738247394527024]
PUFM++ is an enhanced flow-matching framework for reconstructing point clouds from sparse, noisy, and partial observations.<n>We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better.<n>Experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling.
arXiv Detail & Related papers (2025-12-24T06:30:42Z)
MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization [6.027431240137503]
Cross-view geo-localization aims to determine the geographical location of a query image by matching it against a gallery of images.<n>This task is challenging due to the significant appearance variations of objects observed from variable views, along with the difficulty in extracting discriminative features.<n>Existing approaches often rely on extracting features through feature map segmentation while neglecting spatial and semantic information.
arXiv Detail & Related papers (2025-09-16T04:51:52Z)
Fourier-Guided Attention Upsampling for Image Super-Resolution [0.13999481573773068]
Frequency-Guided Attention (FGA) is a lightweight upsampling module for single image super-resolution.<n>Trials show average PSNR gains of 0.120.14 dB and improved frequency-domain consistency by up to 29%.
arXiv Detail & Related papers (2025-08-14T13:13:17Z)
Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition [83.40450475728792]
We present Freqformer, a Transformer-based framework specifically designed for image demoir'eing through targeted frequency separation.<n>Our method performs an effective frequency decomposition that explicitly splits moir'e patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions.<n>Experiments on various demoir'eing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size.
arXiv Detail & Related papers (2025-05-25T12:23:10Z)
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation [24.531539125814877]
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks.<n>One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution.<n>Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality.
arXiv Detail & Related papers (2025-05-04T11:59:26Z)
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models [27.379438040350188]
Feature upsampling offers a promising direction to address this challenge.<n>We introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features.<n>Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions.
arXiv Detail & Related papers (2025-04-18T18:46:08Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Misalignment-Robust Frequency Distribution Loss for Image Transformation [51.0462138717502]
This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution. We introduce a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain. Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain.
arXiv Detail & Related papers (2024-02-28T09:27:41Z)
FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain. Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z)
iffDetector: Inference-aware Feature Filtering for Object Detection [70.8678270164057]
We introduce a generic Inference-aware Feature Filtering (IFF) module that can easily be combined with modern detectors. IFF performs closed-loop optimization by leveraging high-level semantics to enhance the convolutional features. IFF can be fused with CNN-based object detectors in a plug-and-play manner with negligible computational cost overhead.
arXiv Detail & Related papers (2020-06-23T02:57:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.