Pose-Aware Weakly-Supervised Action Segmentation
- URL: http://arxiv.org/abs/2504.05700v1
- Date: Tue, 08 Apr 2025 05:42:55 GMT
- Title: Pose-Aware Weakly-Supervised Action Segmentation
- Authors: Seth Z. Zhao, Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, Behzad Dariush,
- Abstract summary: We introduce a weakly-supervised framework that incorporates pose knowledge during training while omitting its use during inference.<n>We propose a pose-inspired contrastive loss as a part of the whole framework which is trained to distinguish action boundaries more effectively.<n>Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos.
- Score: 11.154829751558006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.
Related papers
- The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - Towards Deeply Unified Depth-aware Panoptic Segmentation with
Bi-directional Guidance Learning [63.63516124646916]
We propose a deeply unified framework for depth-aware panoptic segmentation.
We propose a bi-directional guidance learning approach to facilitate cross-task feature learning.
Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets.
arXiv Detail & Related papers (2023-07-27T11:28:33Z) - SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models [22.472167814814448]
We propose a new model-based imitation learning algorithm named Separated Model-based Adversarial Imitation Learning (SeMAIL)
Our method achieves near-expert performance on various visual control tasks with complex observations and the more challenging tasks with different backgrounds from expert observations.
arXiv Detail & Related papers (2023-06-19T04:33:44Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Weakly-Supervised Online Action Segmentation in Multi-View Instructional
Videos [20.619236432228625]
We present a framework to segment streaming videos online at test time using Dynamic Programming.
We improve our framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency.
arXiv Detail & Related papers (2022-03-24T19:27:56Z) - Adversarial Motion Modelling helps Semi-supervised Hand Pose Estimation [116.07661813869196]
We propose to combine ideas from adversarial training and motion modelling to tap into unlabeled videos.
We show that an adversarial leads to better properties of the hand pose estimator via semi-supervised training on unlabeled video sequences.
The main advantage of our approach is that we can make use of unpaired videos and joint sequence data both of which are much easier to attain than paired training data.
arXiv Detail & Related papers (2021-06-10T17:50:19Z) - Unsupervised Co-part Segmentation through Assembly [42.874278526843305]
We propose an unsupervised learning approach for co-part segmentation from images.
We leverage motion information embedded in videos and explicitly extract latent representations to segment meaningful object parts.
We show that our approach can achieve meaningful and compact part segmentation, outperforming state-of-the-art approaches on diverse benchmarks.
arXiv Detail & Related papers (2021-06-10T16:22:53Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z) - Learning to Segment Actions from Observation and Narration [56.99443314542545]
We apply a generative segmental model of task structure, guided by narration, to action segmentation in video.
We focus on unsupervised and weakly-supervised settings where no action labels are known during training.
arXiv Detail & Related papers (2020-05-07T18:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.