Related papers: Learning Appearance and Motion Cues for Panoptic Tracking

Learning Appearance and Motion Cues for Panoptic Tracking

URL: http://arxiv.org/abs/2503.09191v1
Date: Wed, 12 Mar 2025 09:32:29 GMT
Title: Learning Appearance and Motion Cues for Panoptic Tracking
Authors: Juana Valeria Hurtado, Sajad Marvi, Rohit Mohan, Abhinav Valada,
Abstract summary: Panoptic tracking enables pixel-level scene of videos by integrating instance tracking in panoptic segmentation.<n>We propose a novel approach for panoptic tracking that simultaneously captures information and instance-specific appearance and motion features.
Score: 13.062016289815057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Panoptic tracking enables pixel-level scene interpretation of videos by integrating instance tracking in panoptic segmentation. This provides robots with a spatio-temporal understanding of the environment, an essential attribute for their operation in dynamic environments. In this paper, we propose a novel approach for panoptic tracking that simultaneously captures general semantic information and instance-specific appearance and motion features. Unlike existing methods that overlook dynamic scene attributes, our approach leverages both appearance and motion cues through dedicated network heads. These interconnected heads employ multi-scale deformable convolutions that reason about scene motion offsets with semantic context and motion-enhanced appearance features to learn tracking embeddings. Furthermore, we introduce a novel two-step fusion module that integrates the outputs from both heads by first matching instances from the current time step with propagated instances from previous time steps and subsequently refines associations using motion-enhanced appearance embeddings, improving robustness in challenging scenarios. Extensive evaluations of our proposed \netname model on two benchmark datasets demonstrate that it achieves state-of-the-art performance in panoptic tracking accuracy, surpassing prior methods in maintaining object identities over time. To facilitate future research, we make the code available at http://panoptictracking.cs.uni-freiburg.de

Related papers

Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z)
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion. We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments [10.303368447554591]
This paper proposes a multi-task framework to simultaneously predict scene flow and instance segmentation of full-temporal point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multitask scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse
arXiv Detail & Related papers (2025-03-19T02:43:19Z)
Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects.<n>Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion.<n>Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities.<n>We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z)
Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video [3.2195139886901813]
We present a novel framework that learns multi-level semantical dynamics and multi-frame human pose estimation.<n>Specifically, we first design a multi-masked context and pose reconstruction strategy.<n>This strategy stimulates the model to explore multi-temporal semantic relationships among frames by progressively masking the features of optical (patch) cubes and frames.
arXiv Detail & Related papers (2025-02-15T00:35:34Z)
Temporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation [1.6584112749108326]
TCDSG, Temporally Consistent Dynamic Scene Graphs, is an end-to-end framework that detects, tracks, and links subject-object relationships across time.<n>Our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
arXiv Detail & Related papers (2024-12-03T20:19:20Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Exploring Motion and Appearance Information for Temporal Sentence Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding. We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations. Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
MOPT: Multi-Object Panoptic Tracking [33.77171216778909]
We introduce a novel perception task denoted as multi-object panoptic tracking (MOPT) MOPT allows for exploiting pixel-level semantic information of 'thing' and'stuff' classes, temporal coherence, and pixel-level associations over time. We present extensive quantitative and qualitative evaluations of both vision-based and LiDAR-based MOPT that demonstrate encouraging results.
arXiv Detail & Related papers (2020-04-17T11:45:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.