Related papers: Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration

Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration

URL: http://arxiv.org/abs/2408.09336v1
Date: Sun, 18 Aug 2024 02:33:45 GMT
Title: Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration
Authors: Hao Ai, Lin Wang,
Abstract summary: 360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. We propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously.
Score: 5.577935944665
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: 360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image's ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning [1.5604334108839177]
Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations.<n>We introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework.<n>We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively.<n>Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset.
arXiv Detail & Related papers (2025-10-03T19:04:15Z)
SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment [11.586275116426442]
Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems.<n>We propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images.
arXiv Detail & Related papers (2025-07-03T15:15:21Z)
econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS. Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z)
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions. We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z)
Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation. It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation [64.07560335451723]
CoSER is a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D. It achieves both efficiency and quality by meticulously learning neighbor-view coherence. It aggregates information along motion paths explicitly defined by physical principles to refine details.
arXiv Detail & Related papers (2024-08-23T15:16:01Z)
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z)
S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving [40.305452898732774]
S$3$M-Net is a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously. S$3$M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability.
arXiv Detail & Related papers (2024-01-21T06:47:33Z)
Aligning and Prompting Everything All at Once for Universal Visual Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks. APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection. Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z)
General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance. We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z)
Multi-Projection Fusion and Refinement Network for Salient Object Detection in 360{\deg} Omnidirectional Image [141.10227079090419]
We propose a Multi-Projection Fusion and Refinement Network (MPFR-Net) to detect the salient objects in 360deg omnidirectional image. MPFR-Net uses the equirectangular projection image and four corresponding cube-unfolding images as inputs. Experimental results on two omnidirectional datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-23T14:50:40Z)
DenseMTL: Cross-task Attention Mechanism for Dense Multi-task Learning [18.745373058797714]
We propose a novel multi-task learning architecture that leverages pairwise cross-task exchange through correlation-guided attention and self-attention. We conduct extensive experiments across three multi-task setups, showing the advantages of our approach compared to competitive baselines in both synthetic and real-world benchmarks.
arXiv Detail & Related papers (2022-06-17T17:59:45Z)
Improving Point Cloud Semantic Segmentation by Learning 3D Object Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving. Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes. We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.