Related papers: Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects

URL: http://arxiv.org/abs/2511.05356v1
Date: Fri, 07 Nov 2025 15:47:56 GMT
Title: Canonical Space Representation for 4D Panoptic Segmentation of Articulated Objects
Authors: Manuel Gomes, Bogdan Raducanu, Miguel Oliveira,
Abstract summary: Articulated object perception presents significant challenges in computer vision.<n>Most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects.<n>We propose CanonSeg4D, a novel 4D panoptic segmentation framework.
Score: 5.7565330936756025
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Articulated object perception presents significant challenges in computer vision, particularly because most existing methods ignore temporal dynamics despite the inherently dynamic nature of such objects. The use of 4D temporal data has not been thoroughly explored in articulated object perception and remains unexamined for panoptic segmentation. The lack of a benchmark dataset further hurt this field. To this end, we introduce Artic4D as a new dataset derived from PartNet Mobility and augmented with synthetic sensor data, featuring 4D panoptic annotations and articulation parameters. Building on this dataset, we propose CanonSeg4D, a novel 4D panoptic segmentation framework. This approach explicitly estimates per-frame offsets mapping observed object parts to a learned canonical space, thereby enhancing part-level segmentation. The framework employs this canonical representation to achieve consistent alignment of object parts across sequential frames. Comprehensive experiments on Artic4D demonstrate that the proposed CanonSeg4D outperforms state of the art approaches in panoptic segmentation accuracy in more complex scenarios. These findings highlight the effectiveness of temporal modeling and canonical alignment in dynamic object understanding, and pave the way for future advances in 4D articulated object perception.

Related papers

Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z)
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction [40.557276644446475]
We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos.<n>Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos.
arXiv Detail & Related papers (2025-12-12T19:11:11Z)
Inferring Compositional 4D Scenes without Ever Seeing One [58.81854043690171]
We propose a method that consistently predicts structure and-temporal configuration of 4D/3D objects.<n>We achieve this by a carefully designed training of several spatial and temporal attentions on 2D video input.<n>By alternating between spatial and temporal reasoning, COM4D reconstructs complete and composed scenes.
arXiv Detail & Related papers (2025-12-04T21:51:47Z)
C4D: 4D Made from 3D through Dual Correspondences [77.04731692213663]
We introduce C4D, a framework that leverages temporal correspondences to extend existing 3D reconstruction formulation to 4D.<n>C4D captures two types of correspondences: short-term optical flow and long-term point tracking.<n>We train a dynamic-aware point tracker that provides additional mobility information.
arXiv Detail & Related papers (2025-10-16T17:59:06Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [115.67081491747943]
Dynamic 3D scene representation and novel view synthesis are crucial for enabling AR/VR and metaverse applications.<n>We reformulate the reconstruction of a time-varying 3D scene as approximating its underlying 4D volume.<n>We derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
4DRecons: 4D Neural Implicit Deformable Objects Reconstruction from a single RGB-D Camera with Geometrical and Topological Regularizations [35.161541396566705]
4DRecons encodes the output as a 4D neural implicit surface. We show that 4DRecons can handle large deformations and complex inter-part interactions.
arXiv Detail & Related papers (2024-06-14T16:38:00Z)
4D Panoptic Scene Graph Generation [102.22082008976228]
We introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. We propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs.
arXiv Detail & Related papers (2024-05-16T17:56:55Z)
Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z)
LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human Modeling [69.56581851211841]
We propose a novel Local 4D implicit Representation for Dynamic clothed human, named LoRD. Our key insight is to encourage the network to learn the latent codes of local part-level representation. LoRD has strong capability for representing 4D human, and outperforms state-of-the-art methods on practical applications.
arXiv Detail & Related papers (2022-08-18T03:49:44Z)
Neural Part Priors: Learning to Optimize Part-Based Object Completion in RGB-D Scans [27.377128012679076]
We propose to leverage large-scale synthetic datasets of 3D shapes annotated with part information to learn Neural Part Priors. We can optimize over the learned part priors in order to fit to real-world scanned 3D scenes at test time. Experiments on the ScanNet dataset demonstrate that NPPs significantly outperforms state of the art in part decomposition and object completion.
arXiv Detail & Related papers (2022-03-17T15:05:44Z)
Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking. Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.