A Spatial-Temporal Dual-Mode Mixed Flow Network for Panoramic Video
Salient Object Detection
- URL: http://arxiv.org/abs/2310.09016v1
- Date: Fri, 13 Oct 2023 11:25:41 GMT
- Title: A Spatial-Temporal Dual-Mode Mixed Flow Network for Panoramic Video
Salient Object Detection
- Authors: Xiaolei Chen, Pengcheng Zhang, Zelong Du, Ishfaq Ahmad
- Abstract summary: We propose a Spatial-Temporal Dual-Mode Mixed Flow Network (STDMMF-Net) that exploits the spatial flow of panoramic video and the corresponding optical flow for SOD.
A large number of subjective and objective experimental results testify that the proposed method demonstrates better detection accuracy than the state-of-the-art (SOTA) methods.
The comprehensive performance of the proposed method is better in terms of memory required for model inference, testing time, complexity, and generalization performance.
- Score: 5.207048071888257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Salient object detection (SOD) in panoramic video is still in the initial
exploration stage. The indirect application of 2D video SOD method to the
detection of salient objects in panoramic video has many unmet challenges, such
as low detection accuracy, high model complexity, and poor generalization
performance. To overcome these hurdles, we design an Inter-Layer Attention
(ILA) module, an Inter-Layer weight (ILW) module, and a Bi-Modal Attention
(BMA) module. Based on these modules, we propose a Spatial-Temporal Dual-Mode
Mixed Flow Network (STDMMF-Net) that exploits the spatial flow of panoramic
video and the corresponding optical flow for SOD. First, the ILA module
calculates the attention between adjacent level features of consecutive frames
of panoramic video to improve the accuracy of extracting salient object
features from the spatial flow. Then, the ILW module quantifies the salient
object information contained in the features of each level to improve the
fusion efficiency of the features of each level in the mixed flow. Finally, the
BMA module improves the detection accuracy of STDMMF-Net. A large number of
subjective and objective experimental results testify that the proposed method
demonstrates better detection accuracy than the state-of-the-art (SOTA)
methods. Moreover, the comprehensive performance of the proposed method is
better in terms of memory required for model inference, testing time,
complexity, and generalization performance.
Related papers
- MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection [9.780498146964097]
We propose an innovative network architecture, MonoMM, for real-time monocular 3D object detection.
MonoMM consists of Focused Multi-Scale Fusion (FMF) and Depth-Aware Feature Enhancement Mamba (DMB) modules.
Our method outperforms previous monocular methods and achieves real-time detection.
arXiv Detail & Related papers (2024-08-01T10:16:58Z) - DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing [8.530409994516619]
Multispectral oriented object detection faces challenges due to both inter-modal and intra-modal discrepancies.
We propose Disparity-guided Multispectral Mamba (DMM), a framework comprised of a Disparity-guided Cross-modal Fusion Mamba (DCFM) module, a Multi-scale Target-aware Attention (MTA) module, and a Target-Prior Aware (TPA) auxiliary task.
arXiv Detail & Related papers (2024-07-11T02:09:59Z) - Efficient Unsupervised Video Object Segmentation Network Based on Motion
Guidance [1.5736899098702974]
This paper proposes a video object segmentation network based on motion guidance.
The model comprises a dual-stream network, motion guidance module, and multi-scale progressive fusion module.
The experimental results prove the superior performance of the proposed method.
arXiv Detail & Related papers (2022-11-10T06:13:23Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Multi-View Adaptive Fusion Network for 3D Object Detection [14.506796247331584]
3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving.
We propose a single-stage multi-view fusion framework that takes LiDAR bird's-eye view, LiDAR range view and camera view images as inputs for 3D object detection.
We design an end-to-end learnable network named MVAF-Net to integrate these two components.
arXiv Detail & Related papers (2020-11-02T00:06:01Z) - RGB-D Salient Object Detection with Cross-Modality Modulation and
Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD)
The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.