ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics
- URL: http://arxiv.org/abs/2512.09510v1
- Date: Wed, 10 Dec 2025 10:34:43 GMT
- Title: ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics
- Authors: Donato Caramia, Florian T. Pokorny, Giuseppe Triggiani, Denis Ruffino, David Naso, Paolo Roberto Massenio,
- Abstract summary: We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation.<n>We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario.
- Score: 6.117506436557094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
Related papers
- Segment Anything, Even Occluded [35.150696061791805]
SAMEO is a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder.<n>We introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets.<n>Our results demonstrate that our approach, when trained on the newly extended dataset, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks.
arXiv Detail & Related papers (2025-03-08T16:14:57Z) - MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [49.45034796115852]
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment.<n>Current datasets fall short in scale, realism and do not capture the nature of OR scenes, limiting multimodal in OR modeling.<n>We introduce MM-OR, a realistic and large-scale multimodal OR dataset, and first dataset to enable multimodal scene graph generation.
arXiv Detail & Related papers (2025-03-04T13:00:52Z) - LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion [79.22197702626542]
This paper introduces a framework that explores amodal segmentation for robotic grasping in cluttered scenes.
We propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net)
The results on different datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-06T14:50:48Z) - BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.<n>The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z) - Hyper-Transformer for Amodal Completion [82.4118011026855]
Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information.
We introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN)
This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks.
arXiv Detail & Related papers (2024-05-30T11:11:54Z) - Amodal Ground Truth and Completion in the Wild [84.54972153436466]
We use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images.
This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels.
arXiv Detail & Related papers (2023-12-28T18:59:41Z) - Self-supervised Amodal Video Object Segmentation [57.929357732733926]
Amodal perception requires inferring the full shape of an object that is partially occluded.
This paper develops a new framework of amodal Video object segmentation (SaVos)
arXiv Detail & Related papers (2022-10-23T14:09:35Z) - AISFormer: Amodal Instance Segmentation with Transformer [9.042737643989561]
Amodal Instance (AIS) aims to segment the region of both visible and possible occluded parts of an object instance.
We present AISFormer, an AIS framework, with a Transformer-based mask head.
arXiv Detail & Related papers (2022-10-12T15:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.