A topological solution to object segmentation and tracking
- URL: http://arxiv.org/abs/2107.02036v1
- Date: Mon, 5 Jul 2021 13:52:57 GMT
- Title: A topological solution to object segmentation and tracking
- Authors: Thomas Tsao and Doris Y. Tsao
- Abstract summary: Current computer vision approaches to segmentation and tracking that approach human performance all require learning.
Here, we show that the mathematical structure of light rays reflected from environment surfaces yields a natural representation of persistent surfaces.
We demonstrate that our approach can segment and invariantly track objects in cluttered synthetic video despite severe appearance changes, without requiring learning.
- Score: 0.951828574518325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The world is composed of objects, the ground, and the sky. Visual perception
of objects requires solving two fundamental challenges: segmenting visual input
into discrete units, and tracking identities of these units despite appearance
changes due to object deformation, changing perspective, and dynamic occlusion.
Current computer vision approaches to segmentation and tracking that approach
human performance all require learning, raising the question: can objects be
segmented and tracked without learning? Here, we show that the mathematical
structure of light rays reflected from environment surfaces yields a natural
representation of persistent surfaces, and this surface representation provides
a solution to both the segmentation and tracking problems. We describe how to
generate this surface representation from continuous visual input, and
demonstrate that our approach can segment and invariantly track objects in
cluttered synthetic video despite severe appearance changes, without requiring
learning.
Related papers
- Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Seeing Objects in a Cluttered World: Computational Objectness from
Motion in Video [0.0]
Perception of the visually disjoint surfaces of our world as whole objects physically distinct from those overlapping them forms the basis of our visual perception.
We present a simple but novel approach to infer objectness from phenomenology without object models.
We show that it delivers robust perception of individual attended objects in cluttered scenes, even with blur and camera shake.
arXiv Detail & Related papers (2024-02-02T03:57:11Z) - Tracking through Containers and Occluders in the Wild [32.86030395660071]
We introduce $textbfTCOW$, a new benchmark and model for visual tracking through heavy occlusion and containment.
We create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance.
We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.
arXiv Detail & Related papers (2023-05-04T17:59:58Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - SAVi++: Towards End-to-End Object-Centric Learning from Real-World
Videos [23.64091569954785]
We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation.
By using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Open dataset.
arXiv Detail & Related papers (2022-06-15T18:57:07Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [59.12750806239545]
We show that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis.
Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images.
By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.
arXiv Detail & Related papers (2021-11-11T18:59:11Z) - Object-Centric Representation Learning with Generative Spatial-Temporal
Factorization [5.403549896734018]
We propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes.
We show that DyMON learns to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations.
We also show that the factorized scene representations support querying about a single object by space and time independently.
arXiv Detail & Related papers (2021-11-09T20:04:16Z) - Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition [57.088328223220934]
Existing scene understanding systems mainly focus on recognizing the visible parts of a scene, ignoring the intact appearance of physical objects in the real-world.
In this work, we propose a higher-level scene understanding system to tackle both visible and invisible parts of objects and backgrounds in a given scene.
arXiv Detail & Related papers (2021-04-12T11:37:23Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.