Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for
Robotic Grasping
- URL: http://arxiv.org/abs/2303.11228v2
- Date: Fri, 14 Jul 2023 22:23:31 GMT
- Title: Bimodal SegNet: Instance Segmentation Fusing Events and RGB Frames for
Robotic Grasping
- Authors: Sanket Kachole, Xiaoqian Huang, Fariborz Baghaei Naeini, Rajkumar
Muthusamy, Dimitrios Makris, Yahya Zweiri
- Abstract summary: We propose a Deep Learning network that fuses two types of visual signals, event-based data and RGB frame data.
The Bimodal SegNet network has two distinct encoders, one for each signal input and a spatial pyramidal pooling with atrous convolutions.
The evaluation results show a 6-10% improvement over state-of-the-art methods in terms of mean intersection over the union and pixel accuracy.
- Score: 4.191965713559235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object segmentation for robotic grasping under dynamic conditions often faces
challenges such as occlusion, low light conditions, motion blur and object size
variance. To address these challenges, we propose a Deep Learning network that
fuses two types of visual signals, event-based data and RGB frame data. The
proposed Bimodal SegNet network has two distinct encoders, one for each signal
input and a spatial pyramidal pooling with atrous convolutions. Encoders
capture rich contextual information by pooling the concatenated features at
different resolutions while the decoder obtains sharp object boundaries. The
evaluation of the proposed method undertakes five unique image degradation
challenges including occlusion, blur, brightness, trajectory and scale variance
on the Event-based Segmentation (ESD) Dataset. The evaluation results show a
6-10\% segmentation accuracy improvement over state-of-the-art methods in terms
of mean intersection over the union and pixel accuracy. The model code is
available at https://github.com/sanket0707/Bimodal-SegNet.git
Related papers
- Spatial-information Guided Adaptive Context-aware Network for Efficient
RGB-D Semantic Segmentation [9.198120596225968]
We propose an efficient lightweight encoder-decoder network that reduces the computational parameters and guarantees the robustness of the algorithm.
Experimental results on NYUv2, SUN RGB-D, and Cityscapes datasets show that our method achieves a better trade-off among segmentation accuracy, inference time, and parameters than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-11T09:02:03Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Robust Double-Encoder Network for RGB-D Panoptic Segmentation [31.807572107839576]
Panoptic segmentation provides an interpretation of the scene by computing a pixelwise semantic label together with instance IDs.
We propose a novel encoder-decoder neural network that processes RGB and depth separately through two encoders.
We show that our approach achieves superior results compared to other common approaches for panoptic segmentation.
arXiv Detail & Related papers (2022-10-06T11:46:37Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Boundary-Aware Segmentation Network for Mobile and Web Applications [60.815545591314915]
Boundary-Aware Network (BASNet) is integrated with a predict-refine architecture and a hybrid loss for highly accurate image segmentation.
BASNet runs at over 70 fps on a single GPU which benefits many potential real applications.
Based on BASNet, we further developed two (close to) commercial applications: AR COPY & PASTE, in which BASNet is augmented reality for "COPY" and "PASTING" real-world objects, and OBJECT CUT, which is a web-based tool for automatic object background removal.
arXiv Detail & Related papers (2021-01-12T19:20:26Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z) - Suppress and Balance: A Simple Gated Network for Salient Object
Detection [89.88222217065858]
We propose a simple gated network (GateNet) to solve both issues at once.
With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder.
In addition, we adopt the atrous spatial pyramid pooling based on the proposed "Fold" operation (Fold-ASPP) to accurately localize salient objects of various scales.
arXiv Detail & Related papers (2020-07-16T02:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.