Related papers: Retrieving and Highlighting Action with Spatiotemporal Reference

Retrieving and Highlighting Action with Spatiotemporal Reference

URL: http://arxiv.org/abs/2005.09183v1
Date: Tue, 19 May 2020 03:12:31 GMT
Title: Retrieving and Highlighting Action with Spatiotemporal Reference
Authors: Seito Kasai, Yuchi Ishikawa, Masaki Hayashi, Yoshimitsu Aoki, Kensho Hara, Hirokatsu Kataoka
Abstract summary: We present a framework that jointly retrieves andtemporally highlights actions in videos. Our work takes on the novel task of highlighting action highlighting, which visualizes where and when actions occur in an un video setting.
Score: 15.283548146322971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks which focus on classification or window-based localization. Leveraging weak supervision from annotated captions, our framework acquires spatiotemporal relevance maps and generates local embeddings which relate to the nouns and verbs in captions. Through experiments, we show that our model generates various maps conditioned on different actions, in which conventional visual reasoning methods only go as far as to show a single deterministic saliency map. Also, our model improves retrieval recall over our baseline without alignment by 2-3% on the MSR-VTT dataset.

Related papers

Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization [22.58434223222062]
We propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets.
arXiv Detail & Related papers (2025-04-18T04:35:35Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism. Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z)
Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z)
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time. Models for this task are usually trained with human-annotated sentences and bounding box supervision. We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z)
Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z)
Knowledge Prompting for Few-shot Action Recognition [20.973999078271483]
We propose a simple yet effective method, called knowledge prompting, to prompt a powerful vision-language model for few-shot classification. We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base. We feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
arXiv Detail & Related papers (2022-11-22T06:05:17Z)
Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space. The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z)
Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. One recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.