Related papers: FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

URL: http://arxiv.org/abs/2506.23323v3
Date: Tue, 15 Jul 2025 07:19:26 GMT
Title: FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
Authors: Quang-Huy Che, Vinh-Tiep Nguyen,
Abstract summary: Open-vocabulary semantic segmentation aims to segment objects from arbitrary text categories without requiring densely annotated datasets.<n>We present FA-Seg, a training-free framework for open-vocabulary segmentation based on diffusion models.
Score: 1.4525238046020867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code will be open-sourced after this paper is accepted.

Related papers

No time to train! Training-Free Reference-Based Instance Segmentation [15.061599989448867]
This work investigates the task of object segmentation when provided with only a small set of reference images.<n>Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image.<n>We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method.
arXiv Detail & Related papers (2025-07-03T16:59:01Z)
Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation [37.61604558855609]
Universal Domain Adaptation (UniDA) focuses on transferring source domain knowledge to the target domain under both domain shift and unknown category shift.<n>Current methods typically obtain target domain semantics centers from an unconstrained continuous image representation space.<n>In this paper, based on vision-language models, we search for semantic centers in a semantically meaningful and discrete text representation space.
arXiv Detail & Related papers (2025-06-04T03:11:53Z)
Federated Unsupervised Semantic Segmentation [14.64737842208937]
This work explores the application of Federated Learning (FL) in Unsupervised Semantic image (USS)<n>FUSS is the first framework to enable fully decentralized, label-free semantic segmentation training.<n>Experiments on both benchmark and real-world datasets, including binary and multi-class segmentation tasks, show that FUSS consistently outperforms local-only client trainings.
arXiv Detail & Related papers (2025-05-29T09:43:55Z)
Improving Open-Set Semantic Segmentation in 3D Point Clouds by Conditional Channel Capacity Maximization: Preliminary Results [1.1328543389752008]
We propose a plug and play framework for Open-Set Semantic (O3S)<n>By modeling the segmentation pipeline as a conditional Markov chain, we derive a novel regularizer term dubbed Conditional Channel Capacity Maximization (3CM)<n>We show that 3CM encourages the encoder to retain richer, label-dependent features, thereby enhancing the network's ability to distinguish and segment previously unseen categories.
arXiv Detail & Related papers (2025-05-09T04:12:26Z)
One-shot In-context Part Segmentation [97.77292483684877]
We present the One-shot In-context Part (OIParts) framework to tackle the challenges of part segmentation.<n>Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient.<n>We have achieved remarkable segmentation performance across diverse object categories.
arXiv Detail & Related papers (2025-03-03T03:50:54Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
iSeg: An Iterative Refinement-based Framework for Training-free Segmentation [85.58324416386375]
We present a deep experimental analysis on iteratively refining cross-attention map with self-attention map. We propose an effective iterative refinement framework for training-free segmentation, named iSeg. Our proposed iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature.
arXiv Detail & Related papers (2024-09-05T03:07:26Z)
Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC) SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z)
Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
Masked Supervised Learning for Semantic Segmentation [5.177947445379688]
Masked Supervised Learning (MaskSup) is an effective single-stage learning paradigm that models both short- and long-range context. We show that the proposed method is computationally efficient, yielding an improved performance by 10% on the mean intersection-over-union (mIoU)
arXiv Detail & Related papers (2022-10-03T13:30:19Z)
Beyond the Prototype: Divide-and-conquer Proxies for Few-shot Segmentation [63.910211095033596]
Few-shot segmentation aims to segment unseen-class objects given only a handful of densely labeled samples. We propose a simple yet versatile framework in the spirit of divide-and-conquer. Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information.
arXiv Detail & Related papers (2022-04-21T06:21:14Z)
Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains. We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z)
Generalized Few-shot Semantic Segmentation [68.69434831359669]
We introduce a new benchmark called Generalized Few-Shot Semantic (GFS-Seg) to analyze the ability of simultaneously segmenting the novel categories. It is the first study showing that previous representative state-of-the-art generalizations fall short in GFS-Seg. We propose the Context-Aware Prototype Learning (CAPL) that significantly improves performance by 1) leveraging the co-occurrence prior knowledge from support samples, and 2) dynamically enriching contextual information to the conditioned, on the content of each query image.
arXiv Detail & Related papers (2020-10-11T10:13:21Z)
Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results. Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.