REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders
- URL: http://arxiv.org/abs/2505.18153v1
- Date: Fri, 23 May 2025 17:59:33 GMT
- Title: REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders
- Authors: Savya Khosla, Sethuraman TV, Barnett Lee, Alexander Schwing, Derek Hoiem,
- Abstract summary: Region Network (REN) is a fast and effective model for generating region-based image representations using point prompts.<n>REN bypasses this bottleneck using a lightweight module that directly generates region tokens.<n>It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens.
- Score: 52.61034140869803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.
Related papers
- METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation [91.91385816767057]
Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories.
We propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation.
Our SED method achieves mIoU score of 31.6% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000.
arXiv Detail & Related papers (2023-11-27T05:00:38Z) - Asymmetric Learned Image Compression with Multi-Scale Residual Block,
Importance Map, and Post-Quantization Filtering [15.056672221375104]
Deep learning-based image compression has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC.
Many leading learned schemes cannot maintain a good trade-off between performance and complexity.
We propose an effcient and effective image coding framework, which achieves similar R-D performance with lower complexity than the state of the art.
arXiv Detail & Related papers (2022-06-21T09:34:29Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation [95.47168925127089]
We present a novel, real-time, semantic segmentation network in which the encoder both encodes and generates the parameters (weights) of the decoder.
We design a new type of hypernetwork, composed of a nested U-Net for drawing higher level context features.
arXiv Detail & Related papers (2020-12-21T18:58:18Z) - EfficientFCN: Holistically-guided Decoding for Semantic Segmentation [49.27021844132522]
State-of-the-art semantic segmentation algorithms are mostly based on dilated Fully Convolutional Networks (dilatedFCN)
We propose the EfficientFCN, whose backbone is a common ImageNet pre-trained network without any dilated convolution.
Such a framework achieves comparable or even better performance than state-of-the-art methods with only 1/3 of the computational cost.
arXiv Detail & Related papers (2020-08-24T14:48:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.