Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition
- URL: http://arxiv.org/abs/2303.01526v2
- Date: Fri, 29 Sep 2023 03:20:21 GMT
- Title: Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition
- Authors: Yiqing Liang, Eliot Laidlaw, Alexander Meyerowitz, Srinath Sridhar,
James Tompkin
- Abstract summary: We reconstruct a neural volume that captures time-varying color, density, scene flow, semantics, and attention information.
The semantics and attention let us identify salient foreground objects separately from the background across spacetime.
We show that this method can decompose dynamic scenes in an unsupervised way with competitive performance to a supervised method.
- Score: 51.67493993845143
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From video, we reconstruct a neural volume that captures time-varying color,
density, scene flow, semantics, and attention information. The semantics and
attention let us identify salient foreground objects separately from the
background across spacetime. To mitigate low resolution semantic and attention
features, we compute pyramids that trade detail with whole-image context. After
optimization, we perform a saliency-aware clustering to decompose the scene. To
evaluate real-world scenes, we annotate object masks in the NVIDIA Dynamic
Scene and DyCheck datasets. We demonstrate that this method can decompose
dynamic scenes in an unsupervised way with competitive performance to a
supervised method, and that it improves foreground/background segmentation over
recent static/dynamic split methods. Project Webpage:
https://visual.cs.brown.edu/saff
Related papers
- Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement [32.335953514942474]
This paper proposes to jointly learn the scene representation along with a 3D dense feature field and a 2D feature extractor.
We learn the underlying geometry of the scene with an implicit field through volumetric rendering and design our feature field to leverage intermediate geometric information encoded in the implicit field.
Visual localization is then achieved by aligning the image-based features and the rendered volumetric features.
arXiv Detail & Related papers (2024-06-12T17:51:53Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - Learning To Segment Dominant Object Motion From Watching Videos [72.57852930273256]
We envision a simple framework for dominant moving object segmentation that neither requires annotated data to train nor relies on saliency priors or pre-trained optical flow maps.
Inspired by a layered image representation, we introduce a technique to group pixel regions according to their affine parametric motion.
This enables our network to learn segmentation of the dominant foreground object using only RGB image pairs as input for both training and inference.
arXiv Detail & Related papers (2021-11-28T14:51:00Z) - Weakly Supervised Learning of Rigid 3D Scene Flow [81.37165332656612]
We propose a data-driven scene flow estimation algorithm exploiting the observation that many 3D scenes can be explained by a collection of agents moving as rigid bodies.
We showcase the effectiveness and generalization capacity of our method on four different autonomous driving datasets.
arXiv Detail & Related papers (2021-02-17T18:58:02Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Neural Scene Graphs for Dynamic Scenes [57.65413768984925]
We present the first neural rendering method that decomposes dynamic scenes into scene graphs.
We learn implicitly encoded scenes, combined with a jointly learned latent representation to describe objects with a single implicit function.
arXiv Detail & Related papers (2020-11-20T12:37:10Z) - Semantic Scene Completion using Local Deep Implicit Functions on LiDAR
Data [4.355440821669468]
We propose a scene segmentation network based on local Deep Implicit Functions as a novel learning-based method for scene completion.
We show that this continuous representation is suitable to encode geometric and semantic properties of extensive outdoor scenes without the need for spatial discretization.
Our experiments verify that our method generates a powerful representation that can be decoded into a dense 3D description of a given scene.
arXiv Detail & Related papers (2020-11-18T07:39:13Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.