Videoprompter: an ensemble of foundational models for zero-shot video
understanding
- URL: http://arxiv.org/abs/2310.15324v1
- Date: Mon, 23 Oct 2023 19:45:46 GMT
- Title: Videoprompter: an ensemble of foundational models for zero-shot video
understanding
- Authors: Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan,
Mubarak Shah
- Abstract summary: Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
- Score: 113.92958148574228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models (VLMs) classify the query video by calculating a
similarity score between the visual features and text-based class label
representations. Recently, large language models (LLMs) have been used to
enrich the text-based class labels by enhancing the descriptiveness of the
class names. However, these improvements are restricted to the text-based
classifier only, and the query visual features are not considered. In this
paper, we propose a framework which combines pre-trained discriminative VLMs
with pre-trained generative video-to-text and text-to-text models. We introduce
two key modifications to the standard zero-shot setting. First, we propose
language-guided visual feature enhancement and employ a video-to-text model to
convert the query video to its descriptive form. The resulting descriptions
contain vital visual cues of the query video, such as what objects are present
and their spatio-temporal interactions. These descriptive cues provide
additional semantic knowledge to VLMs to enhance their zeroshot performance.
Second, we propose video-specific prompts to LLMs to generate more meaningful
descriptions to enrich class label representations. Specifically, we introduce
prompt techniques to create a Tree Hierarchy of Categories for class names,
offering a higher-level action context for additional visual cues, We
demonstrate the effectiveness of our approach in video understanding across
three different zero-shot settings: 1) video action recognition, 2)
video-to-text and textto-video retrieval, and 3) time-sensitive video tasks.
Consistent improvements across multiple benchmarks and with various VLMs
demonstrate the effectiveness of our proposed framework. Our code will be made
publicly available.
Related papers
- Open Vocabulary Multi-Label Video Classification [45.722133656740446]
We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task.
We propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes.
arXiv Detail & Related papers (2024-07-12T07:53:54Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model [10.742625681420279]
Based on a large-scale pre-trained Vision Language Model (VLM), our model learns and improves visual, textual, and numerical representations of patient gait videos.
Results demonstrate that our model not only significantly outperforms state-of-the-art methods in video-based classification tasks but also adeptly decodes the learned class-specific text features into natural language descriptions.
arXiv Detail & Related papers (2024-03-20T17:03:38Z) - LLMs as Visual Explainers: Advancing Image Classification with Evolving
Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors.
Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z) - I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification [108.83932812826521]
Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks.
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
arXiv Detail & Related papers (2022-12-05T14:11:36Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.