Related papers: Storyboard guided Alignment for Fine-grained Video Action Recognition

Storyboard guided Alignment for Fine-grained Video Action Recognition

URL: http://arxiv.org/abs/2410.14238v1
Date: Fri, 18 Oct 2024 07:40:41 GMT
Title: Storyboard guided Alignment for Fine-grained Video Action Recognition
Authors: Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, Liu Liu,
Abstract summary: Fine-grained video action recognition can be conceptualized as a video-text matching problem. We propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics.
Score: 32.02631248389487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained video action recognition can be conceptualized as a video-text matching problem. Previous approaches often rely on global video semantics to consolidate video embeddings, which can lead to misalignment in video-text pairs due to a lack of understanding of action semantics at an atomic granularity level. To tackle this challenge, we propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics. Inspired by the concept of storyboarding, which disassembles a script into individual shots, we enhance global video semantics by generating fine-grained descriptions using a pre-trained large language model. These detailed descriptions capture common atomic actions depicted in videos. A filtering metric is proposed to select the descriptions that correspond to the atomic actions present in both the videos and the descriptions. By employing global semantics and fine-grained descriptions, we can identify key frames in videos and utilize them to aggregate embeddings, thereby making the embedding more accurate. Extensive experiments on various video action recognition datasets demonstrate superior performance of our proposed method in supervised, few-shot, and zero-shot settings.

Related papers

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification, captioning, and localization. We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z)
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z)
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition [8.18503795495178]
We prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.
arXiv Detail & Related papers (2023-11-30T13:32:43Z)
SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies. We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events. Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z)
Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description. Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions. We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z)
Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning. We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z)
Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning. We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively. Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.