A Compacted Structure for Cross-domain learning on Monocular Depth and
Flow Estimation
- URL: http://arxiv.org/abs/2208.11993v1
- Date: Thu, 25 Aug 2022 10:46:29 GMT
- Title: A Compacted Structure for Cross-domain learning on Monocular Depth and
Flow Estimation
- Authors: Yu Chen, Xu Cao, Xiaoyi Lin, Baoru Huang, Xiao-Yun Zhou, Jian-Qing
Zheng, Guang-Zhong Yang
- Abstract summary: This paper presents a multi-task scheme that achieves mutual assistance by means of Flow to Depth (F2D), Depth to Flow (D2F), and Exponential Moving Average (EMA)
A dual-head mechanism is used to predict optical flow for rigid and non-rigid motion based on a divide-and-conquer manner.
Experiments on KITTI datasets show that our multi-task scheme outperforms other multi-task schemes.
- Score: 31.671655267992683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate motion and depth recovery is important for many robot vision tasks
including autonomous driving. Most previous studies have achieved cooperative
multi-task interaction via either pre-defined loss functions or cross-domain
prediction. This paper presents a multi-task scheme that achieves mutual
assistance by means of our Flow to Depth (F2D), Depth to Flow (D2F), and
Exponential Moving Average (EMA). F2D and D2F mechanisms enable multi-scale
information integration between optical flow and depth domain based on
differentiable shallow nets. A dual-head mechanism is used to predict optical
flow for rigid and non-rigid motion based on a divide-and-conquer manner, which
significantly improves the optical flow estimation performance. Furthermore, to
make the prediction more robust and stable, EMA is used for our multi-task
training. Experimental results on KITTI datasets show that our multi-task
scheme outperforms other multi-task schemes and provide marked improvements on
the prediction results.
Related papers
- ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - Exploring End-to-end Differentiable Neural Charged Particle Tracking -- A Loss Landscape Perspective [0.0]
We propose an E2E differentiable decision-focused learning scheme for particle tracking.
We show that differentiable variations of discrete assignment operations allows for efficient network optimization.
We argue that E2E differentiability provides, besides the general availability of gradient information, an important tool for robust particle tracking to mitigate prediction instabilities.
arXiv Detail & Related papers (2024-07-18T11:42:58Z) - StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction [22.29257945966914]
We propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP)
We construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively.
We also improve the quality and consistency of predicted trajectories with a dual-stream predictor.
arXiv Detail & Related papers (2024-06-28T11:35:35Z) - Efficient Multitask Dense Predictor via Binarization [19.5100813204537]
We introduce network binarization to compress resource-intensive multi-task dense predictors.
We propose a Binary Multi-task Dense Predictor, Bi-MTDP, and several variants of Bi-MTDP.
One variant of Bi-MTDP outperforms full-precision (FP) multi-task dense prediction SoTAs, ARTC (CNN-based) and InvPT (ViT-Based)
arXiv Detail & Related papers (2024-05-23T03:19:23Z) - AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection [23.91870504363899]
Double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data.
This has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems.
We introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network.
arXiv Detail & Related papers (2024-05-21T17:17:17Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - FLODCAST: Flow and Depth Forecasting via Multimodal Recurrent
Architectures [31.879514593973195]
We propose a flow and depth forecasting model, trained to jointly forecast both modalities at once.
We train the proposed model to also perform predictions for several timesteps in the future.
We report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework.
arXiv Detail & Related papers (2023-10-31T16:30:16Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.