TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo   Matching within A Joint Learning Framework
        - URL: http://arxiv.org/abs/2407.18038v3
- Date: Tue, 10 Sep 2024 13:48:23 GMT
- Title: TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo   Matching within A Joint Learning Framework
- Authors: Guanfeng Tang, Zhiyuan Wu, Jiahang Li, Ping Zhong, Xieyuanli Chen, Huiming Lu, Rui Fan, 
- Abstract summary: TiCoSS is a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching.
This study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function.
- Score: 10.005854418001219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%. Our source code will be publicly available at mias.group/TiCoSS upon publication. 
 
      
        Related papers
        - RL-U$^2$Net: A Dual-Branch UNet with Reinforcement Learning-Assisted   Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation [0.624829068285122]
 We propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment.<n>The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module.<n> Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-U$2$Net outperforms existing state-of-the-art methods.
 arXiv  Detail & Related papers  (2025-08-04T16:12:06Z)
- Confidence-driven Gradient Modulation for Multimodal Human Activity   Recognition: A Dynamic Contrastive Dual-Path Learning Approach [3.0868241505670198]
 We propose a novel framework called the Dynamic Contrastive Dual-Path Network (D-HAR)<n>The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseCDPNet branches collaboratively process multimodal sensor data.<n>Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction.<n>Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation.
 arXiv  Detail & Related papers  (2025-07-03T17:37:46Z)
- S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching
  for Autonomous Driving [40.305452898732774]
 S$3$M-Net is a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously.
S$3$M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability.
 arXiv  Detail & Related papers  (2024-01-21T06:47:33Z)
- SCD-Net: Spatiotemporal Clues Disentanglement Network for
  Self-supervised Skeleton-based Action Recognition [39.99711066167837]
 This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net)
 Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively.
We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
 arXiv  Detail & Related papers  (2023-09-11T21:32:13Z)
- Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
  Correlations for Language-guided HOI detection [57.13665112065285]
 Human-Object Interaction (HOI) detection is a challenging computer vision task.
We present a framework that enhances HOI detection by incorporating structured text knowledge.
 arXiv  Detail & Related papers  (2023-07-25T14:20:52Z)
- Motor Imagery Decoding Using Ensemble Curriculum Learning and
  Collaborative Training [11.157243900163376]
 Multi-subject EEG datasets present several kinds of domain shifts.
These domain shifts impede robust cross-subject generalization.
We propose a two-stage model ensemble architecture built with multiple feature extractors.
We demonstrate that our model ensembling approach combines the powers of curriculum learning and collaborative training.
 arXiv  Detail & Related papers  (2022-11-21T13:45:44Z)
- Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
  Action Recognition [49.163326827954656]
 We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
 arXiv  Detail & Related papers  (2021-08-10T09:25:07Z)
- CoADNet: Collaborative Aggregation-and-Distribution Networks for
  Co-Salient Object Detection [91.91911418421086]
 Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
 arXiv  Detail & Related papers  (2020-11-10T04:28:11Z)
- Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
  Gesture Recognition [89.0152015268929]
 We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
 arXiv  Detail & Related papers  (2020-08-21T10:45:09Z)
- Bi-Directional Attention for Joint Instance and Semantic Segmentation in
  Point Clouds [9.434847591440485]
 We build a Bi-Directional Attention module on backbone neural networks for 3D point cloud perception.
It uses similarity matrix measured from features for one task to help aggregate non-local information for the other task.
From comprehensive experiments and ablation studies on the S3DIS dataset and the PartNet dataset, the superiority of our method is verified.
 arXiv  Detail & Related papers  (2020-03-11T17:16:07Z)
- Cross-modality Person re-identification with Shared-Specific Feature
  Transfer [112.60513494602337]
 Cross-modality person re-identification (cm-ReID) is a challenging but key technology for intelligent video analysis.
We propose a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics.
 arXiv  Detail & Related papers  (2020-02-28T00:18:45Z)
- Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
 We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
 arXiv  Detail & Related papers  (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.