When Did It Happen? Duration-informed Temporal Localization of Narrated
Actions in Vlogs
- URL: http://arxiv.org/abs/2202.08138v1
- Date: Wed, 16 Feb 2022 15:26:12 GMT
- Title: When Did It Happen? Duration-informed Temporal Localization of Narrated
Actions in Vlogs
- Authors: Oana Ignat, Santiago Castro, Yuhang Zhou, Jiajun Bao, Dandan Shan
- Abstract summary: We consider the task of temporal human action localization in lifestyle vlogs.
We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips.
We propose a simple yet effective method to localize the narrated actions based on their expected duration.
- Score: 3.9146761527401424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the task of temporal human action localization in lifestyle
vlogs. We introduce a novel dataset consisting of manual annotations of
temporal localization for 13,000 narrated actions in 1,200 video clips. We
present an extensive analysis of this data, which allows us to better
understand how the language and visual modalities interact throughout the
videos. We propose a simple yet effective method to localize the narrated
actions based on their expected duration. Through several experiments and
analyses, we show that our method brings complementary information with respect
to previous methods, and leads to improvements over previous work for the task
of temporal action localization.
Related papers
- Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos.
We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles.
Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial
Grounding [117.23208392452693]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.