Related papers: When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

URL: http://arxiv.org/abs/2202.08138v1
Date: Wed, 16 Feb 2022 15:26:12 GMT
Title: When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Authors: Oana Ignat, Santiago Castro, Yuhang Zhou, Jiajun Bao, Dandan Shan
Abstract summary: We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We propose a simple yet effective method to localize the narrated actions based on their expected duration.
Score: 3.9146761527401424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods, and leads to improvements over previous work for the task of temporal action localization.

Related papers

Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization [22.58434223222062]
We propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets.
arXiv Detail & Related papers (2025-04-18T04:35:35Z)
Learning to Ground Instructional Articles in Videos through Narrations [50.3463147014498]
We present an approach for localizing steps of procedural activities in narrated how-to videos. We source the step descriptions from a language knowledge base (wikiHow) containing instructional articles. Our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities.
arXiv Detail & Related papers (2023-06-06T15:45:53Z)
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time. Models for this task are usually trained with human-annotated sentences and bounding box supervision. We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos. We propose a transformer-based model to incorporate temporal audio-visual context. Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z)
DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm. A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)
Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.