Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection
- URL: http://arxiv.org/abs/2207.06603v1
- Date: Thu, 14 Jul 2022 01:45:03 GMT
- Title: Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection
- Authors: Zhe Chen, Jing Zhang, Yufei Xu, Dacheng Tao
- Abstract summary: Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
- Score: 77.50110439560152
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current object detectors typically have a feature pyramid (FP) module for
multi-level feature fusion (MFF) which aims to mitigate the gap between
features from different levels and form a comprehensive object representation
to achieve better detection performance. However, they usually require heavy
cross-level connections or iterative refinement to obtain better MFF result,
making them complicated in structure and inefficient in computation. To address
these issues, we propose a novel and efficient context modeling mechanism that
can help existing FPs deliver better MFF results while reducing the
computational costs effectively. In particular, we introduce a novel insight
that comprehensive contexts can be decomposed and condensed into two types of
representations for higher efficiency. The two representations include a
locally concentrated representation and a globally summarized representation,
where the former focuses on extracting context cues from nearby areas while the
latter extracts key representations of the whole image scene as global context
cues. By collecting the condensed contexts, we employ a Transformer decoder to
investigate the relations between them and each local feature from the FP and
then refine the MFF results accordingly. As a result, we obtain a simple and
light-weight Transformer-based Context Condensation (TCC) module, which can
boost various FPs and lower their computational costs simultaneously. Extensive
experimental results on the challenging MS COCO dataset show that TCC is
compatible to four representative FPs and consistently improves their detection
accuracy by up to 7.8 % in terms of average precision and reduce their
complexities by up to around 20% in terms of GFLOPs, helping them achieve
state-of-the-art performance more efficiently. Code will be released.
Related papers
- Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Efficient and Effective Deep Multi-view Subspace Clustering [9.6753782215283]
We propose a novel deep framework, termed Efficient and Effective deep Multi-View Subspace Clustering (E$2$MVSC)
Instead of a parameterized FC layer, we design a Relation-Metric Net that decouples network parameter scale from sample numbers for greater computational efficiency.
E$2$MVSC yields comparable results to existing methods and achieves state-of-the-art performance in various types of multi-view datasets.
arXiv Detail & Related papers (2023-10-15T03:08:25Z) - MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - Disentangled Federated Learning for Tackling Attributes Skew via
Invariant Aggregation and Diversity Transferring [104.19414150171472]
Attributes skews the current federated learning (FL) frameworks from consistent optimization directions among the clients.
We propose disentangled federated learning (DFL) to disentangle the domain-specific and cross-invariant attributes into two complementary branches.
Experiments verify that DFL facilitates FL with higher performance, better interpretability, and faster convergence rate, compared with SOTA FL methods.
arXiv Detail & Related papers (2022-06-14T13:12:12Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Dynamic Feature Pyramid Networks for Object Detection [40.24111664691307]
We introduce an inception FPN in which each layer contains convolution filters with different kernel sizes to enlarge the receptive field.
We propose a new dynamic FPN (DyFPN) which consists of multiple branches with different computational costs.
Experiments conducted on benchmarks demonstrate that the proposed DyFPN significantly improves performance with the optimal allocation of computation resources.
arXiv Detail & Related papers (2020-12-01T19:03:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.