Making a Case for 3D Convolutions for Object Segmentation in Videos
- URL: http://arxiv.org/abs/2008.11516v2
- Date: Fri, 1 Sep 2023 14:02:37 GMT
- Title: Making a Case for 3D Convolutions for Object Segmentation in Videos
- Authors: Sabarinath Mahadevan, Ali Athar, Aljo\v{s}a O\v{s}ep, Sebastian
Hennen, Laura Leal-Taix\'e, Bastian Leibe
- Abstract summary: We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
- Score: 16.167397418720483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of object segmentation in videos is usually accomplished by
processing appearance and motion information separately using standard 2D
convolutional networks, followed by a learned fusion of the two sources of
information. On the other hand, 3D convolutional networks have been
successfully applied for video classification tasks, but have not been
leveraged as effectively to problems involving dense per-pixel interpretation
of videos compared to their 2D convolutional counterparts and lag behind the
aforementioned networks in terms of performance. In this work, we show that 3D
CNNs can be effectively applied to dense video prediction tasks such as salient
object segmentation. We propose a simple yet effective encoder-decoder network
architecture consisting entirely of 3D convolutions that can be trained
end-to-end using a standard cross-entropy loss. To this end, we leverage an
efficient 3D encoder, and propose a 3D decoder architecture, that comprises
novel 3D Global Convolution layers and 3D Refinement modules. Our approach
outperforms existing state-of-the-arts by a large margin on the DAVIS'16
Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster,
thus showing that our architecture can efficiently learn expressive
spatio-temporal features and produce high quality video segmentation masks. We
have made our code and trained models publicly available at
https://github.com/sabarim/3DC-Seg.
Related papers
- EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer [42.68740105997167]
We introduce two frameworks for 3D object detection with minimal hand-crafted design.
Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal.
Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information.
arXiv Detail & Related papers (2024-06-12T12:40:28Z) - DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations.
We leverage the strong semantic prior within a 3D generative model to train a semantic decoder.
Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video
Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network.
We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z) - Pointwise Attention-Based Atrous Convolutional Neural Networks [15.499267533387039]
A pointwise attention-based atrous convolutional neural network architecture is proposed to efficiently deal with a large number of points.
The proposed model has been evaluated on the two most important 3D point cloud datasets for the 3D semantic segmentation task.
It achieves a reasonable performance compared to state-of-the-art models in terms of accuracy, with a much smaller number of parameters.
arXiv Detail & Related papers (2019-12-27T13:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.