More than the Sum of Its Parts: Ensembling Backbone Networks for
Few-Shot Segmentation
- URL: http://arxiv.org/abs/2402.06581v1
- Date: Fri, 9 Feb 2024 18:01:15 GMT
- Title: More than the Sum of Its Parts: Ensembling Backbone Networks for
Few-Shot Segmentation
- Authors: Nico Catalano, Alessandro Maranelli, Agnese Chiatti, Matteo Matteucci
- Abstract summary: We study whether fusing features from different backbones can improve the ability of acrlongfss models to capture richer visual features.
We propose and compare two ensembling techniques-Independent Voting and Feature Fusion.
Our approach outperforms the original single-backbone PANet across standard benchmarks even in challenging one-shot learning scenarios.
- Score: 49.090592800481616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation is a key prerequisite to robust image understanding for
applications in \acrlong{ai} and Robotics. \acrlong{fss}, in particular,
concerns the extension and optimization of traditional segmentation methods in
challenging conditions where limited training examples are available. A
predominant approach in \acrlong{fss} is to rely on a single backbone for
visual feature extraction. Choosing which backbone to leverage is a deciding
factor contributing to the overall performance. In this work, we interrogate on
whether fusing features from different backbones can improve the ability of
\acrlong{fss} models to capture richer visual features. To tackle this
question, we propose and compare two ensembling techniques-Independent Voting
and Feature Fusion. Among the available \acrlong{fss} methods, we implement the
proposed ensembling techniques on PANet. The module dedicated to predicting
segmentation masks from the backbone embeddings in PANet avoids trainable
parameters, creating a controlled `in vitro' setting for isolating the impact
of different ensembling strategies. Leveraging the complementary strengths of
different backbones, our approach outperforms the original single-backbone
PANet across standard benchmarks even in challenging one-shot learning
scenarios. Specifically, it achieved a performance improvement of +7.37\% on
PASCAL-5\textsuperscript{i} and of +10.68\% on COCO-20\textsuperscript{i} in
the top-performing scenario where three backbones are combined. These results,
together with the qualitative inspection of the predicted subject masks,
suggest that relying on multiple backbones in PANet leads to a more
comprehensive feature representation, thus expediting the successful
application of \acrlong{fss} methods in challenging, data-scarce environments.
Related papers
- Face Forgery Detection with Elaborate Backbone [50.914676786151574]
Face Forgery Detection aims to determine whether a digital face is real or fake.
Previous FFD models directly employ existing backbones to represent and extract forgery cues.
We propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone.
We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues.
arXiv Detail & Related papers (2024-09-25T13:57:16Z) - A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.
We also develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
Our proposed ReSFU framework consistently achieves satisfactory performance on different segmentation applications.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
This paper explores the differences across various CLIP-trained vision backbones.
Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z) - Multi-Content Interaction Network for Few-Shot Segmentation [37.80624074068096]
Few-Shot COCO is challenging for limited support images and large intra-class appearance discrepancies.
We propose a Multi-Content Interaction Network (MCINet) to remedy this issue.
MCINet improves FSS by incorporating the low-level structural information from another query branch into the high-level semantic features.
arXiv Detail & Related papers (2023-03-11T04:21:59Z) - GaitStrip: Gait Recognition via Effective Strip-based Feature
Representations and Multi-Level Framework [34.397404430838286]
We present a strip-based multi-level gait recognition network, named GaitStrip, to extract comprehensive gait information at different levels.
To be specific, our high-level branch explores the context of gait sequences and our low-level one focuses on detailed posture changes.
Our GaitStrip achieves state-of-the-art performance in both normal walking and complex conditions.
arXiv Detail & Related papers (2022-03-08T09:49:48Z) - Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing [110.4684789199555]
We introduce scenario-based semantic parsing: a variant of the original task which first requires disambiguating an utterance's "scenario"
This formulation enables us to isolate coarse-grained and fine-grained aspects of the task, each of which we solve with off-the-shelf neural modules.
Our model is modular, differentiable, interpretable, and allows us to garner extra supervision from scenarios.
arXiv Detail & Related papers (2022-02-02T08:00:21Z) - CFNet: Learning Correlation Functions for One-Stage Panoptic
Segmentation [46.252118473248316]
We propose to first predict semantic-level and instance-level correlations among different locations that are utilized to enhance the backbone features.
We then feed the improved discriminative features into the corresponding segmentation heads, respectively.
We achieve state-of-the-art performance on MS with $45.1$% PQ and ADE20k with $32.6$% PQ.
arXiv Detail & Related papers (2022-01-13T05:31:14Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z) - Unsupervised segmentation via semantic-apparent feature fusion [21.75371777263847]
This research proposes an unsupervised foreground segmentation method based on semantic-apparent feature fusion (SAFF)
Key regions of foreground object can be accurately responded via semantic features, while apparent features provide richer detailed expression.
By fusing semantic and apparent features, as well as cascading the modules of intra-image adaptive feature weight learning and inter-image common feature learning, the research achieves performance that significantly exceeds baselines.
arXiv Detail & Related papers (2020-05-21T08:28:49Z) - Multi-Person Pose Estimation with Enhanced Feature Aggregation and
Selection [33.15192824888279]
We propose a novel Enhanced Feature Aggregation and Selection network (EFASNet) for multi-person 2D human pose estimation.
Our method can well handle crowded, cluttered and occluded scenes.
Comprehensive experiments demonstrate that the proposed approach outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2020-03-20T08:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.