Related papers: Open-Vocabulary Action Localization with Iterative Visual Prompting

Open-Vocabulary Action Localization with Iterative Visual Prompting

URL: http://arxiv.org/abs/2408.17422v4
Date: Thu, 10 Oct 2024 07:22:48 GMT
Title: Open-Vocabulary Action Localization with Iterative Visual Prompting
Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi,
Abstract summary: Video action localization aims to find the timings of specific actions from a long video. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models.
Score: 8.07285448283823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start and end of the action. Iterating this process by narrowing a sampling time window results in finding the specific frames corresponding to the start and end of an action. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results illustrate a practical extension of VLMs for understanding videos. A sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/.

Related papers

Enhancing Long Video Question Answering with Scene-Localized Frame Grouping [19.83545369186771]
Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding.<n>We propose a new scenario under the video question-answering task, SceneQA.<n>We introduce a novel method called SLFG to combine individual frames into semantically coherent scene frames.
arXiv Detail & Related papers (2025-08-05T02:28:58Z)
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames [70.93346841539626]
We present Temporal Chain of Thought, an inference strategy for video question-answering.<n>We use the VLM itself to iteratively identify and extract the most relevant frames from the video.<n>We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy.
arXiv Detail & Related papers (2025-07-01T18:39:26Z)
Parameter-free Video Segmentation for Vision and Language Understanding [55.20132267309382]
We propose an algorithm for segmenting videos into contiguous chunks, based on the minimum description length principle. The algorithm is entirely parameter-free, given feature vectors, not requiring a set threshold or the number or size of chunks to be specified.
arXiv Detail & Related papers (2025-03-03T05:54:37Z)
Zero-shot Action Localization via the Confidence of Large Vision-Language Models [19.683461002518147]
We introduce a true Zero-shot Action localization method (ZEAL) Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions into detailed descriptions. We demonstrate remarkable results in zero-shot action localization on a challenging benchmark, without any training.
arXiv Detail & Related papers (2024-10-18T09:51:14Z)
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition [84.31749632725929]
In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains.
arXiv Detail & Related papers (2024-03-03T16:48:16Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z)
Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning. Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z)
Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text. A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length. We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z)
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views. Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z)
Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL) Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos. The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z)
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features. We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks. Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
Straight to the Point: Fast-forwarding Videos via Reinforcement Learning Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos. Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video. We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.