Object Segmentation by Mining Cross-Modal Semantics
- URL: http://arxiv.org/abs/2305.10469v3
- Date: Fri, 4 Aug 2023 19:52:15 GMT
- Title: Object Segmentation by Mining Cross-Modal Semantics
- Authors: Zongwei Wu, Jingjing Wang, Zhuyun Zhou, Zhaochong An, Qiuping Jiang,
C\'edric Demonceaux, Guolei Sun, Radu Timofte
- Abstract summary: We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
- Score: 68.88086621181628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-sensor clues have shown promise for object segmentation, but inherent
noise in each sensor, as well as the calibration error in practice, may bias
the segmentation accuracy. In this paper, we propose a novel approach by mining
the Cross-Modal Semantics to guide the fusion and decoding of multimodal
features, with the aim of controlling the modal contribution based on relative
entropy. We explore semantics among the multimodal inputs in two aspects: the
modality-shared consistency and the modality-specific variation. Specifically,
we propose a novel network, termed XMSNet, consisting of (1) all-round
attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer
self-supervision. On the one hand, the AF block explicitly dissociates the
shared and specific representation and learns to weight the modal contribution
by adjusting the \textit{proportion, region,} and \textit{pattern}, depending
upon the quality. On the other hand, our CFD initially decodes the shared
feature and then refines the output through specificity-aware querying.
Further, we enforce semantic consistency across the decoding layers to enable
interaction across network hierarchies, improving feature discriminability.
Exhaustive comparison on eleven datasets with depth or thermal clues, and on
two challenging tasks, namely salient and camouflage object segmentation,
validate our effectiveness in terms of both performance and robustness. The
source code is publicly available at https://github.com/Zongwei97/XMSNet.
Related papers
- Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems.
We propose a novel hierarchical feature refinement network for event-frame fusion.
Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z) - Generalized Correspondence Matching via Flexible Hierarchical Refinement
and Patch Descriptor Distillation [13.802788788420175]
Correspondence matching plays a crucial role in numerous robotics applications.
This paper addresses the limitations of deep feature matching (DFM), a state-of-the-art (SoTA) plug-and-play correspondence matching approach.
Our proposed method achieves an overall performance in terms of mean matching accuracy of 0.68, 0.92, and 0.95 with respect to the tolerances of 1, 3, and 5 pixels, respectively.
arXiv Detail & Related papers (2024-03-08T15:32:18Z) - DiffVein: A Unified Diffusion Network for Finger Vein Segmentation and
Authentication [50.017055360261665]
We introduce DiffVein, a unified diffusion model-based framework which simultaneously addresses vein segmentation and authentication tasks.
For better feature interaction between these two branches, we introduce two specialized modules.
In this way, our framework allows for a dynamic interplay between diffusion and segmentation embeddings.
arXiv Detail & Related papers (2024-02-03T06:49:42Z) - FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals
in Factorized Orthogonal Latent Space [7.324708513042455]
This paper proposes a novel contrastive learning framework, called FOCAL, for extracting comprehensive features from multimodal time-series sensing signals.
It consistently outperforms the state-of-the-art baselines in downstream tasks with a clear margin.
arXiv Detail & Related papers (2023-10-30T22:55:29Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - A cross-modal fusion network based on self-attention and residual
structure for multimodal emotion recognition [7.80238628278552]
We propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition.
To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset.
The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.
arXiv Detail & Related papers (2021-11-03T12:24:03Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z) - Decoupled and Memory-Reinforced Networks: Towards Effective Feature
Learning for One-Step Person Search [65.51181219410763]
One-step methods have been developed to handle pedestrian detection and identification sub-tasks using a single network.
There are two major challenges in the current one-step approaches.
We propose a decoupled and memory-reinforced network (DMRNet) to overcome these problems.
arXiv Detail & Related papers (2021-02-22T06:19:45Z) - AlignSeg: Feature-Aligned Segmentation Networks [109.94809725745499]
We propose Feature-Aligned Networks (AlignSeg) to address misalignment issues during the feature aggregation process.
Our network achieves new state-of-the-art mIoU scores of 82.6% and 45.95%, respectively.
arXiv Detail & Related papers (2020-02-24T10:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.