Related papers: On Evaluating Weakly Supervised Action Segmentation Methods

On Evaluating Weakly Supervised Action Segmentation Methods

URL: http://arxiv.org/abs/2005.09743v3
Date: Thu, 21 Oct 2021 17:16:34 GMT
Title: On Evaluating Weakly Supervised Action Segmentation Methods
Authors: Yaser Souri, Alexander Richard, Luca Minciullo, Juergen Gall
Abstract summary: We focus on two aspects of the use and evaluation of weakly supervised action segmentation approaches. We train each method on the Breakfast dataset 5 times and provide average and standard deviation of the results. Our experiments show that the standard deviation over these repetitions is between 1 and 2.5% and significantly affects the comparison between different approaches.
Score: 79.42955857919497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action segmentation is the task of temporally segmenting every frame of an untrimmed video. Weakly supervised approaches to action segmentation, especially from transcripts have been of considerable interest to the computer vision community. In this work, we focus on two aspects of the use and evaluation of weakly supervised action segmentation approaches that are often overlooked: the performance variance over multiple training runs and the impact of selecting feature extractors for this task. To tackle the first problem, we train each method on the Breakfast dataset 5 times and provide average and standard deviation of the results. Our experiments show that the standard deviation over these repetitions is between 1 and 2.5% and significantly affects the comparison between different approaches. Furthermore, our investigation on feature extraction shows that, for the studied weakly-supervised action segmentation methods, higher-level I3D features perform worse than classical IDT features.

Related papers

Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos [0.1053373860696675]
We present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Our boundary voting mechanism accurately identifies start and end points by leveraging contextual information.
arXiv Detail & Related papers (2025-04-26T01:07:56Z)
Cost-Sensitive Learning for Long-Tailed Temporal Action Segmentation [35.710284726160815]
Temporal action segmentation in procedural videos aims to densely label frames into action classes. We identify a bi-level learning bias, stemming from class imbalance favoring head classes, and a transition-level bias arising from variations in transitions. As a remedy, we introduce a constrained optimization problem to alleviate both biases.
arXiv Detail & Related papers (2025-03-24T05:37:12Z)
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z)
Leveraging triplet loss for unsupervised action segmentation [0.0]
We propose a fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches.
arXiv Detail & Related papers (2023-04-13T11:10:16Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications. We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)
Delving into 3D Action Anticipation from Streaming Videos [99.0155538452263]
Action anticipation aims to recognize the action with a partial observation. We introduce several complementary evaluation metrics and present a basic model based on frame-wise action classification. We also explore multi-task learning strategies by incorporating auxiliary information from two aspects: the full action representation and the class-agnostic action label.
arXiv Detail & Related papers (2019-06-15T10:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.