Diffusion Action Segmentation
- URL: http://arxiv.org/abs/2303.17959v2
- Date: Sat, 12 Aug 2023 02:13:51 GMT
- Title: Diffusion Action Segmentation
- Authors: Daochang Liu, Qiyue Li, AnhDung Dinh, Tingting Jiang, Mubarak Shah,
Chang Xu
- Abstract summary: We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
- Score: 63.061058214427085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action segmentation is crucial for understanding long-form videos.
Previous works on this task commonly adopt an iterative refinement paradigm by
using multi-stage models. We propose a novel framework via denoising diffusion
models, which nonetheless shares the same inherent spirit of such iterative
refinement. In this framework, action predictions are iteratively generated
from random noise with input video features as conditions. To enhance the
modeling of three striking characteristics of human actions, including the
position prior, the boundary ambiguity, and the relational dependency, we
devise a unified masking strategy for the conditioning inputs in our framework.
Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and
Breakfast, are performed and the proposed method achieves superior or
comparable results to state-of-the-art methods, showing the effectiveness of a
generative approach for action segmentation.
Related papers
- Discrete Modeling via Boundary Conditional Diffusion Processes [29.95155303262501]
Previous approaches have suffered from the discrepancy between discrete data and continuous modeling.
We propose a two-step forward process that first estimates the boundary as a prior distribution.
We then rescales the forward trajectory to construct a boundary conditional diffusion model.
arXiv Detail & Related papers (2024-10-29T09:42:42Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - Leveraging triplet loss for unsupervised action segmentation [0.0]
We propose a fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself.
Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions.
Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches.
arXiv Detail & Related papers (2023-04-13T11:10:16Z) - Turning to a Teacher for Timestamp Supervised Temporal Action
Segmentation [27.735478880660164]
We propose a new framework for timestamp supervised temporal action segmentation.
We introduce a teacher model parallel to the segmentation model to help stabilize the process of model optimization.
Our method outperforms the state-of-the-art method and performs comparably against the fully-supervised methods at a much lower annotation cost.
arXiv Detail & Related papers (2022-07-02T02:00:55Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Towards Robust and Adaptive Motion Forecasting: A Causal Representation
Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables.
We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph.
Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z) - Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium
Segmentation [0.0]
We present a novel semi-supervised segmentation model based on parameter decoupling strategy to encourage consistent predictions from diverse views.
Our method has achieved a competitive result over the state-of-the-art semisupervised methods on the Atrial Challenge dataset.
arXiv Detail & Related papers (2021-09-20T14:51:42Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.