M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
- URL: http://arxiv.org/abs/2510.17363v1
- Date: Mon, 20 Oct 2025 10:03:31 GMT
- Title: M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
- Authors: U. V. B. L Udugama, George Vosselman, Francesco Nex,
- Abstract summary: Multi-Mono-Hydra (M2H) is a novel multi-task learning framework for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image.<n>Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment.
- Score: 4.329662126907974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
Related papers
- Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding [18.76513756741288]
Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness.<n>We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network.<n> Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations.
arXiv Detail & Related papers (2025-11-25T18:59:34Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space.
We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Fast GraspNeXt: A Fast Self-Attention Neural Network Architecture for
Multi-task Learning in Computer Vision Tasks for Robotic Grasping on the Edge [80.88063189896718]
High architectural and computational complexity can result in poor suitability for deployment on embedded devices.
Fast GraspNeXt is a fast self-attention neural network architecture tailored for embedded multi-task learning in computer vision tasks for robotic grasping.
arXiv Detail & Related papers (2023-04-21T18:07:14Z) - Joint 2D-3D Multi-Task Learning on Cityscapes-3D: 3D Detection,
Segmentation, and Depth Estimation [11.608682595506354]
TaskPrompter presents an innovative multi-task prompting framework.
It unifies the learning of (i) task-generic representations, (ii) task-specific representations, and (iii) cross-task interactions.
New benchmark requires the multi-task model to concurrently generate predictions for monocular 3D vehicle detection, semantic segmentation, and monocular depth estimation.
arXiv Detail & Related papers (2023-04-03T13:41:35Z) - M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design [95.41238363769892]
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly.
Current MTL regimes have to activate nearly the entire model even to just execute a single task.
We present a model-accelerator co-design framework to enable efficient on-device MTL.
arXiv Detail & Related papers (2022-10-26T15:40:24Z) - DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning [18.745373058797714]
We propose a novel multi-task learning architecture that leverages pairwise cross-task exchange through correlation-guided attention and self-attention.
We conduct extensive experiments across three multi-task setups, showing the advantages of our approach compared to competitive baselines in both synthetic and real-world benchmarks.
arXiv Detail & Related papers (2022-06-17T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.