A Closer Look at Temporal Ordering in the Segmentation of Instructional
Videos
- URL: http://arxiv.org/abs/2209.15501v1
- Date: Fri, 30 Sep 2022 14:44:19 GMT
- Title: A Closer Look at Temporal Ordering in the Segmentation of Instructional
Videos
- Authors: Anil Batra, Shreyank Gowda, Laura Sevilla-Lara, Frank Keller
- Abstract summary: We take a closer look at Procedure and Summarization (PSS) and propose three fundamental improvements over current methods.
We propose a new segmentation metric based on dynamic programming that takes into account the order of segments.
We propose a matching algorithm that constrains the temporal order of segment mapping, and is also differentiable.
- Score: 17.712793578388126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the steps required to perform a task is an important skill for
AI systems. Learning these steps from instructional videos involves two
subproblems: (i) identifying the temporal boundary of sequentially occurring
segments and (ii) summarizing these steps in natural language. We refer to this
task as Procedure Segmentation and Summarization (PSS). In this paper, we take
a closer look at PSS and propose three fundamental improvements over current
methods. The segmentation task is critical, as generating a correct summary
requires the step to be identified first. However, current segmentation metrics
often overestimate the segmentation quality because they do not incorporate the
temporal order of segments. We propose a new segmentation metric based on
dynamic programming that takes into account the order of segments. Current PSS
methods are typically trained by proposing segments, matching them with the
ground truth and computing a loss. However, much like segmentation metrics,
existing matching algorithms do not consider the temporal order of the mapping
between candidate segments and the ground truth. We propose a matching
algorithm that constrains the temporal order of segment mapping, and is also
differentiable. Lastly, we introduce multi-modal feature training for PSS,
which further improves segmentation. We evaluate our approach on two
instructional video datasets (YouCook2 and Tasty) and improve the state of the
art by a margin of $\sim7\%$ and $\sim2.5\%$ for procedure segmentation and
summarization, respectively.
Related papers
- Image Segmentation in Foundation Model Era: A Survey [99.19456390358211]
Current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements.
This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation.
An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts.
arXiv Detail & Related papers (2024-08-23T10:07:59Z) - Online Action Representation using Change Detection and Symbolic Programming [0.3937354192623676]
The proposed method employs a change detection algorithm to automatically segment action sequences.
We show the effectiveness of this representation in the downstream task of class repetition detection.
The results of the experiments demonstrate that, despite operating online, the proposed method performs better or on par with the existing method.
arXiv Detail & Related papers (2024-05-19T10:31:59Z) - Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising.
The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments.
We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z) - A Survey on Label-efficient Deep Segmentation: Bridging the Gap between
Weak Supervision and Dense Prediction [115.9169213834476]
This paper offers a comprehensive review on label-efficient segmentation methods.
We first develop a taxonomy to organize these methods according to the supervision provided by different types of weak labels.
Next, we summarize the existing label-efficient segmentation methods from a unified perspective.
arXiv Detail & Related papers (2022-07-04T06:21:01Z) - Action parsing using context features [0.0]
We argue that context information, particularly the temporal information about other actions in the video sequence, is valuable for action segmentation.
The proposed parsing algorithm temporally segments the video sequence into action segments.
arXiv Detail & Related papers (2022-05-20T07:54:04Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Few-Shot Action Recognition with Compromised Metric via Optimal
Transport [31.834843714684343]
Few-shot action recognition is still not mature despite the wide research of few-shot image classification.
One main obstacle to applying these algorithms in action recognition is the complex structure of videos.
We propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions.
arXiv Detail & Related papers (2021-04-08T12:42:05Z) - Fusing RGBD Tracking and Segmentation Tree Sampling for Multi-Hypothesis
Volumetric Segmentation [6.853379171946806]
Multihypothesis Tracking (MST) is a novel method for volumetric segmentation in changing scenes.
Two main innovations allow us to tackle this difficult problem.
We evaluate our method on several cluttered tabletop environments in simulation and reality.
arXiv Detail & Related papers (2021-04-01T02:17:18Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - STEP: Segmenting and Tracking Every Pixel [107.23184053133636]
We present a new benchmark: Segmenting and Tracking Every Pixel (STEP)
Our work is the first that targets this task in a real-world setting that requires dense interpretation in both spatial and temporal domains.
For measuring the performance, we propose a novel evaluation metric and Tracking Quality (STQ)
arXiv Detail & Related papers (2021-02-23T18:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.