Related papers: OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

URL: http://arxiv.org/abs/2312.00096v2
Date: Thu, 28 Mar 2024 08:25:27 GMT
Title: OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition
Authors: Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen,
Abstract summary: We prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.
Score: 8.18503795495178
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

Related papers

Less is More: Label-Guided Summarization of Procedural and Instructional Videos [21.13311741987469]
We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis.<n>We analyze adaptive visual sampling, label-driven anchoring, and contextual validation using a large language model (LLM)<n>Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.
arXiv Detail & Related papers (2026-01-18T03:41:48Z)
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models [9.896951371033229]
VideoPerceiver is a video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding.<n>We construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames.<n> Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks.
arXiv Detail & Related papers (2025-11-24T06:57:26Z)
Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs) Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM) Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z)
Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
Open Vocabulary Multi-Label Video Classification [52.40942805153809]
We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task.<n>We propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes.
arXiv Detail & Related papers (2024-07-12T07:53:54Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos. We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge. We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z)
Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training. To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification [26.12591949900602]
We formulate a text-based task conditioner to adapt video features to the few-shot learning task. Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
arXiv Detail & Related papers (2021-06-21T15:08:08Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.