Related papers: Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

URL: http://arxiv.org/abs/2403.07432v1
Date: Tue, 12 Mar 2024 09:15:19 GMT
Title: Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow
Authors: Hanyu Zhou, Yi Chang, Zhiwei Shi, Luxin Yan
Abstract summary: Single RGB or LiDAR is the mainstream sensor for the challenging scene flow. Existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. We propose a novel hierarchical visual-motion fusion framework for scene flow.
Score: 17.23190429955172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Single RGB or LiDAR is the mainstream sensor for the challenging scene flow, which relies heavily on visual features to match motion features. Compared with single modality, existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However, these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR, thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work, we bring the event as a bridge between RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework for scene flow, which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion, we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging, and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion, we figure out that RGB, event and LiDAR are complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space, which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method.

Related papers

HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection [75.406055413928]
We propose a novel prompt-driven segment anything model (HyPSAM) for RGB-T SOD.<n> DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction.<n>Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-23T07:32:11Z)
Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation [8.76832497215149]
Event cameras capture motion dynamics, offering a unique modality with great potential in various computer vision tasks.<n> RGB-Event fusion faces three misalignments: (i) temporal, (ii) temporal, and (iii) modal misalignment.<n>We propose Motion-enhanced Event (MET), which transforms sparse event voxels into a dense and temporally coherent form.
arXiv Detail & Related papers (2025-05-02T19:19:58Z)
Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition. Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z)
Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow [21.821959971338767]
We propose a novel common modality fusion between frame and event modalities for high-dynamic scene optical flow. In motion fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has sparse but temporally continuous correlation.
arXiv Detail & Related papers (2025-03-10T07:16:32Z)
Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation [47.75348821902489]
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality.
arXiv Detail & Related papers (2025-01-01T13:40:09Z)
RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation [43.358140897849616]
In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin.
arXiv Detail & Related papers (2023-09-26T17:23:55Z)
Attentive Multimodal Fusion for Optical and Scene Flow [24.08052492109655]
Existing methods typically rely solely on RGB images or fuse the modalities at later stages. We propose a novel deep neural network approach named FusionRAFT, which enables early-stage information fusion between sensor modalities. Our approach exhibits improved robustness in the presence of noise and low-lighting conditions that affect the RGB images.
arXiv Detail & Related papers (2023-07-28T04:36:07Z)
Revisiting Event-based Video Frame Interpolation [49.27404719898305]
Dynamic vision sensors or event cameras provide rich complementary information for video frame. estimating optical flow from events is arguably more difficult than from RGB information. We propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages.
arXiv Detail & Related papers (2023-07-24T06:51:07Z)
Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness. Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation. This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z)
Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection [23.48709176879878]
Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. We introduce a cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality.
arXiv Detail & Related papers (2023-03-30T10:47:26Z)
Does Thermal Really Always Matter for RGB-T Salient Object Detection? [153.17156598262656]
This paper proposes a network named TNet to solve the RGB-T salient object detection (SOD) task. In this paper, we introduce a global illumination estimation module to predict the global illuminance score of the image. On the other hand, we introduce a two-stage localization and complementation module in the decoding phase to transfer object localization cue and internal integrity cue in thermal features to the RGB modality.
arXiv Detail & Related papers (2022-10-09T13:50:12Z)
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation. We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z)
End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos. We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z)
Trear: Transformer-based RGB-D Egocentric Action Recognition [38.20137500372927]
We propose a textbfTransformer-based RGB-D textbfegocentric textbfaction textbfrecognition framework, called Trear. It consists of two modules, inter-frame attention encoder and mutual-attentional fusion block.
arXiv Detail & Related papers (2021-01-05T19:59:30Z)
Learning Selective Mutual Attention and Contrast for RGB-D Saliency Detection [145.4919781325014]
How to effectively fuse cross-modal information is the key problem for RGB-D salient object detection. Many models use the feature fusion strategy but are limited by the low-order point-to-point fusion methods. We propose a novel mutual attention model by fusing attention and contexts from different modalities.
arXiv Detail & Related papers (2020-10-12T08:50:10Z)
Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking [85.333260415532]
We develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. When the appearance cue is unreliable, we take motion cues into account to make the tracker robust. Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-04T08:11:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.