DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action
Segmentation
- URL: http://arxiv.org/abs/2307.16803v1
- Date: Mon, 31 Jul 2023 16:14:24 GMT
- Title: DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action
Segmentation
- Authors: Yue Zhang and Hehe Fan and Yi Yang and Mohan Kankanhalli
- Abstract summary: We present our findings from the research conducted on the Human-Object Interaction 4D (HOI4D) dataset for egocentric action segmentation task.
We convert point cloud videos into depth videos and employ traditional video modeling methods to improve 4D action segmentation.
The proposed method achieved the first place in the 4D Action Track of the HOI4D Challenge 2023.
- Score: 39.806610397357986
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this technical report, we present our findings from the research conducted
on the Human-Object Interaction 4D (HOI4D) dataset for egocentric action
segmentation task. As a relatively novel research area, point cloud video
methods might not be good at temporal modeling, especially for long point cloud
videos (\eg, 150 frames). In contrast, traditional video understanding methods
have been well developed. Their effectiveness on temporal modeling has been
widely verified on many large scale video datasets. Therefore, we convert point
cloud videos into depth videos and employ traditional video modeling methods to
improve 4D action segmentation. By ensembling depth and point cloud video
methods, the accuracy is significantly improved. The proposed method, named
Mixture of Depth and Point cloud video experts (DPMix), achieved the first
place in the 4D Action Segmentation Track of the HOI4D Challenge 2023.
Related papers
- Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [48.87063562819018]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction.
Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning.
Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z) - Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model [52.0192865857058]
We propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video.
Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
arXiv Detail & Related papers (2025-03-28T17:14:48Z) - Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.
Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z) - GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking [38.104532522698285]
Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos.
Inspired by Monocular Dynamic novel View Synthesis (MDVS), we bring pseudo 4D Gaussian fields to video generation.
We finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT.
arXiv Detail & Related papers (2025-01-05T23:55:33Z) - Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video [64.38566659338751]
We propose the first 4D Gaussian Splatting framework to reconstruct a high-quality 4D model from blurry monocular video, named Deblur4DGS.
We introduce exposure regularization to avoid trivial solutions, as well as multi-frame and multi-resolution consistency ones to alleviate artifacts. Beyond novel-view, Deblur4DGS can be applied to improve blurry video from multiple perspectives, including deblurring, frame synthesis, and video stabilization.
arXiv Detail & Related papers (2024-12-09T12:02:11Z) - Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos [21.93514516437402]
We present DreamScene4D, the first approach to generate 3D dynamic scenes of multiple objects from monocular videos via novel view synthesis.
Our key insight is a "decompose-recompose" approach that factorizes the video scene into the background and object tracks.
We show extensive results on challenging DAVIS, Kubric, and self-captured videos with quantitative comparisons and a user preference study.
arXiv Detail & Related papers (2024-05-03T17:55:34Z) - X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos
through Cross-modal Knowledge Transfer [28.719098240737605]
We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer.
It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining.
Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
arXiv Detail & Related papers (2023-12-12T15:48:12Z) - Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from
a Single Image [59.18564636990079]
We study the problem of synthesizing a long-term dynamic video from only a single image.
Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories.
We present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image.
arXiv Detail & Related papers (2023-08-20T12:53:50Z) - Masked Spatio-Temporal Structure Prediction for Self-supervised Learning
on Point Cloud Videos [75.9251839023226]
We propose a Masked-temporal Structure Prediction (MaST-Pre) method to capture the structure of point cloud videos without human annotations.
MaST-Pre consists of two self-supervised learning tasks. First, by reconstructing masked point tubes, our method is able to capture appearance information of point cloud videos.
Second, to learn motion, we propose a temporal cardinality difference prediction task that estimates the change in the number of points within a point tube.
arXiv Detail & Related papers (2023-08-18T02:12:54Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud
Sequence Representation Learning [14.033085586047799]
This paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation.
Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework.
Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks.
arXiv Detail & Related papers (2022-12-10T16:26:19Z) - Learning Fine-Grained Motion Embedding for Landscape Animation [140.57889994591494]
We propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding.
To train and evaluate on diverse time-lapse videos, we build the largest high-resolution Time-lapse video dataset with Diverse scenes.
Our method achieves relative improvements by 19% on LIPIS and 5.6% on FVD compared with state-of-the-art methods on our dataset.
arXiv Detail & Related papers (2021-09-06T02:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.