SHED Light on Segmentation for Dense Prediction
- URL: http://arxiv.org/abs/2601.22529v1
- Date: Fri, 30 Jan 2026 04:02:29 GMT
- Title: SHED Light on Segmentation for Dense Prediction
- Authors: Seung Hyun Lee, Sangwoo Mo, Stella X. Yu,
- Abstract summary: Dense prediction infers per-pixel values from a single image and is fundamental to 3D perception and robotics.<n>We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction.
- Score: 36.2884048331386
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense prediction infers per-pixel values from a single image and is fundamental to 3D perception and robotics. Although real-world scenes exhibit strong structure, existing methods treat it as an independent pixel-wise prediction, often resulting in structural inconsistencies. We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction. By bidirectional hierarchical reasoning, segment tokens are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, allowing the segment hierarchy to emerge without explicit segmentation supervision. SHED improves depth boundary sharpness and segment coherence, while demonstrating strong cross-domain generalization from synthetic to the real-world environments. Its hierarchy-aware decoder better captures global 3D scene layouts, leading to improved semantic segmentation performance. Moreover, SHED enhances 3D reconstruction quality and reveals interpretable part-level structures that are often missed by conventional pixel-wise methods.
Related papers
- Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion [4.679314646805623]
3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects.<n>Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views.<n>We propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level.
arXiv Detail & Related papers (2025-12-07T15:15:52Z) - InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision [29.40602634269908]
Large-scale video-text pretraining achieves strong performance but depends on noisy synthetictemporal with limited semantic coverage.<n>Masked video modeling (MVM) directly exploits trails but text-supervised methods on general tasks.<n>We propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model.
arXiv Detail & Related papers (2025-12-01T06:57:39Z) - Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation [83.90109373769614]
3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation.<n>We propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping.<n>We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability.
arXiv Detail & Related papers (2025-11-30T15:51:30Z) - Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge [45.19482892758984]
Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR.<n>We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimize reconstruction, affinity, and diversity to yield semantically organized representations.<n>We further design the Cross-modal Affordance Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps.
arXiv Detail & Related papers (2025-10-09T15:01:26Z) - Depth Edge Alignment Loss: DEALing with Depth in Weakly Supervised Semantic Segmentation [10.222017971893917]
This study proposes a model-agnostic Depth Edge Alignment Loss to improve Weakly Supervised Semantic models across different datasets.<n>The methodology generates pixel-level semantic labels from image-level supervision, avoiding expensive annotation processes.<n>We demonstrate how our approach improves segmentation performance across datasets and models, but can also be combined with other losses for even better performance.
arXiv Detail & Related papers (2025-09-22T12:42:10Z) - Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective [6.61305444380644]
We argue that there are fundamental connections between semantic segmentation and compression.<n>We derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT)<n>Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter.
arXiv Detail & Related papers (2024-11-05T12:10:02Z) - View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields [52.08335264414515]
We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene.
Our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output.
We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency.
arXiv Detail & Related papers (2024-05-30T04:14:58Z) - COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction [60.87168562615171]
The autonomous driving community has shown significant interest in 3D occupancy prediction.
We propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder.
COTR outperforms baselines with a relative improvement of 8%-15%.
arXiv Detail & Related papers (2023-12-04T14:23:18Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Small Lesion Segmentation in Brain MRIs with Subpixel Embedding [105.1223735549524]
We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues.
We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network.
arXiv Detail & Related papers (2021-09-18T00:21:17Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.