Related papers: PAANet:Visual Perception based Four-stage Framework for Salient Object Detection using High-order Contrast Operator

PAANet:Visual Perception based Four-stage Framework for Salient Object Detection using High-order Contrast Operator

URL: http://arxiv.org/abs/2211.08724v1
Date: Wed, 16 Nov 2022 07:28:07 GMT
Title: PAANet:Visual Perception based Four-stage Framework for Salient Object Detection using High-order Contrast Operator
Authors: Yanbo Yuan, Hua Zhong, Haixiong Li, Xiao cheng, Linmei Xia
Abstract summary: We propose a four-stage framework for salient object detection (SOD) The first two stages match the textbfPre-textbfAttentive process consisting of general feature extraction (GFE) and feature preprocessing (FP) The last two stages are corresponding to textbfAttention process containing saliency feature extraction (SFE) and the feature aggregation (FA)
Score: 5.147934362641464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is believed that human vision system (HVS) consists of pre-attentive process and attention process when performing salient object detection (SOD). Based on this fact, we propose a four-stage framework for SOD, in which the first two stages match the \textbf{P}re-\textbf{A}ttentive process consisting of general feature extraction (GFE) and feature preprocessing (FP), and the last two stages are corresponding to \textbf{A}ttention process containing saliency feature extraction (SFE) and the feature aggregation (FA), namely \textbf{PAANet}. According to the pre-attentive process, the GFE stage applies the fully-trained backbone and needs no further finetuning for different datasets. This modification can greatly increase the training speed. The FP stage plays the role of finetuning but works more efficiently because of its simpler structure and fewer parameters. Moreover, in SFE stage we design for saliency feature extraction a novel contrast operator, which works more semantically in contrast with the traditional convolution operator when extracting the interactive information between the foreground and its surroundings. Interestingly, this contrast operator can be cascaded to form a deeper structure and extract higher-order saliency more effective for complex scene. Comparative experiments with the state-of-the-art methods on 5 datasets demonstrate the effectiveness of our framework.

Related papers

VP Lab: a PEFT-Enabled Visual Prompting Laboratory for Semantic Segmentation [18.680875997611025]
VP Lab is a comprehensive iterative framework that enhances visual prompting for robust segmentation model development.<n>E-PEFT is a novel ensemble of parameter-efficient fine-tuning techniques designed to adapt our visual prompting pipeline to specific domains.<n>By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50% increase in semantic segmentation mIoU performance across various technical datasets.
arXiv Detail & Related papers (2025-05-21T14:46:57Z)
Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation [5.326302374594885]
Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios. We propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors.
arXiv Detail & Related papers (2025-04-20T04:12:38Z)
Prior2Former -- Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation [74.55677741919035]
We propose Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning.<n>P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments.<n>Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes.
arXiv Detail & Related papers (2025-04-07T08:53:14Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion [80.79938369319152]
We design a new pipeline coined PCF-Lift based on our Probabilis-tic Contrastive Fusion (PCF) Our PCF-lift not only significantly outperforms the state-of-the-art methods on widely used benchmarks including the ScanNet dataset and the Messy Room dataset (4.4% improvement of scene-level PQ)
arXiv Detail & Related papers (2024-10-14T16:06:59Z)
ViTGaze: Gaze Following with Interaction Features in Vision Transformers [42.08842391756614]
We introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods.
arXiv Detail & Related papers (2024-03-19T14:45:17Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
CONTRASTE: Supervised Contrastive Pre-training With Aspect-based Prompts For Aspect Sentiment Triplet Extraction [13.077459544929598]
We present a novel pre-training strategy using CONTRastive learning to enhance the ASTE performance. We also demonstrate the advantage of our proposed technique on other ABSA tasks such as ACOS, TASD, and AESC.
arXiv Detail & Related papers (2023-10-24T07:40:09Z)
Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF) We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results. In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z)
GaitStrip: Gait Recognition via Effective Strip-based Feature Representations and Multi-Level Framework [34.397404430838286]
We present a strip-based multi-level gait recognition network, named GaitStrip, to extract comprehensive gait information at different levels. To be specific, our high-level branch explores the context of gait sequences and our low-level one focuses on detailed posture changes. Our GaitStrip achieves state-of-the-art performance in both normal walking and complex conditions.
arXiv Detail & Related papers (2022-03-08T09:49:48Z)
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field. We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network. An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z)
Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA) IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors. IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.