Semantic-aligned Fusion Transformer for One-shot Object Detection
- URL: http://arxiv.org/abs/2203.09093v2
- Date: Sun, 20 Mar 2022 09:27:23 GMT
- Title: Semantic-aligned Fusion Transformer for One-shot Object Detection
- Authors: Yizhou Zhao, Xun Guo, Yan Lu
- Abstract summary: One-shot object detection aims at detecting novel objects according to merely one given instance.
Current approaches explore various feature fusions to obtain directly transferable meta-knowledge.
We propose a simple but effective architecture named Semantic-aligned Fusion Transformer (SaFT) to resolve these issues.
- Score: 18.58772037047498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One-shot object detection aims at detecting novel objects according to merely
one given instance. With extreme data scarcity, current approaches explore
various feature fusions to obtain directly transferable meta-knowledge. Yet,
their performances are often unsatisfactory. In this paper, we attribute this
to inappropriate correlation methods that misalign query-support semantics by
overlooking spatial structures and scale variances. Upon analysis, we leverage
the attention mechanism and propose a simple but effective architecture named
Semantic-aligned Fusion Transformer (SaFT) to resolve these issues.
Specifically, we equip SaFT with a vertical fusion module (VFM) for cross-scale
semantic enhancement and a horizontal fusion module (HFM) for cross-sample
feature fusion. Together, they broaden the vision for each feature point from
the support to a whole augmented feature pyramid from the query, facilitating
semantic-aligned associations. Extensive experiments on multiple benchmarks
demonstrate the superiority of our framework. Without fine-tuning on novel
classes, it brings significant performance gains to one-stage baselines,
lifting state-of-the-art results to a higher level.
Related papers
- A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.
We also develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
Our proposed ReSFU framework consistently achieves satisfactory performance on different segmentation applications.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - Dual-modal Prior Semantic Guided Infrared and Visible Image Fusion for Intelligent Transportation System [22.331591533400402]
Infrared and visible image fusion (IVF) plays an important role in intelligent transportation system (ITS)
We propose a novel prior semantic guided image fusion method based on the dual-modality strategy.
arXiv Detail & Related papers (2024-03-24T16:41:50Z) - Fine-Grained Prototypes Distillation for Few-Shot Object Detection [8.795211323408513]
Few-shot object detection (FSOD) aims at extending a generic detector for novel object detection with only a few training examples.
In general, methods based on meta-learning employ an additional support branch to encode novel examples into class prototypes.
New methods are required to capture the distinctive local context for more robust novel object detection.
arXiv Detail & Related papers (2024-01-15T12:12:48Z) - ICAFusion: Iterative Cross-Attention Guided Feature Fusion for
Multispectral Object Detection [25.66305300362193]
A novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction.
This framework enhances the discriminability of object features through the query-guided cross-attention mechanism.
The proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios.
arXiv Detail & Related papers (2023-08-15T00:02:10Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale
Fusion of Locally Descriptors [15.042741192427334]
This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio.
Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.
arXiv Detail & Related papers (2021-12-02T16:09:33Z) - Exploring Complementary Strengths of Invariant and Equivariant
Representations for Few-Shot Learning [96.75889543560497]
In many real-world problems, collecting a large number of labeled samples is infeasible.
Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples.
We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
arXiv Detail & Related papers (2021-03-01T21:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.