TNT: Text-Conditioned Network with Transductive Inference for Few-Shot
  Video Classification
        - URL: http://arxiv.org/abs/2106.11173v1
- Date: Mon, 21 Jun 2021 15:08:08 GMT
- Title: TNT: Text-Conditioned Network with Transductive Inference for Few-Shot
  Video Classification
- Authors: Andr\'es Villa, Juan-Manuel Perez-Rua, Vladimir Araujo, Juan Carlos
  Niebles, Victor Escorcia, Alvaro Soto
- Abstract summary: We formulate a text-based task conditioner to adapt video features to the few-shot learning task.
Our model obtains state-of-the-art performance on four challenging benchmarks in few-shot video action classification.
- Score: 26.12591949900602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Recently, few-shot learning has received increasing interest. Existing
efforts have been focused on image classification, with very few attempts
dedicated to the more challenging few-shot video classification problem. These
few attempts aim to effectively exploit the temporal dimension in videos for
better learning in low data regimes. However, they have largely ignored a key
characteristic of video which could be vital for few-shot recognition, that is,
videos are often accompanied by rich text descriptions. In this paper, for the
first time, we propose to leverage these human-provided textual descriptions as
privileged information when training a few-shot video classification model.
Specifically, we formulate a text-based task conditioner to adapt video
features to the few-shot learning task. Our model follows a transductive
setting where query samples and support textual descriptions can be used to
update the support set class prototype to further improve the task-adaptation
ability of the model. Our model obtains state-of-the-art performance on four
challenging benchmarks in few-shot video action classification.
 
      
        Related papers
        - A Comprehensive Review of Few-shot Action Recognition [64.47305887411275]
 Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data.
It requires accurately classifying human actions in videos using only a few labeled examples per class.
Numerous approaches have driven significant advancements in few-shot action recognition.
 arXiv  Detail & Related papers  (2024-07-20T03:53:32Z)
- Videoprompter: an ensemble of foundational models for zero-shot video
  understanding [113.92958148574228]
 Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
 arXiv  Detail & Related papers  (2023-10-23T19:45:46Z)
- Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
 Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
 arXiv  Detail & Related papers  (2023-08-16T15:00:50Z)
- A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
  Zero Shot [67.00455874279383]
 We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
 arXiv  Detail & Related papers  (2023-05-16T19:13:11Z)
- TL;DW? Summarizing Instructional Videos with Task Relevance &
  Cross-Modal Saliency [133.75876535332003]
 We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
 arXiv  Detail & Related papers  (2022-08-14T04:07:40Z)
- Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
 We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
 arXiv  Detail & Related papers  (2022-04-19T13:14:43Z)
- Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
 This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
 arXiv  Detail & Related papers  (2021-12-08T18:58:16Z)
- Highlight Timestamp Detection Model for Comedy Videos via Multimodal
  Sentiment Analysis [1.6181085766811525]
 We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
 arXiv  Detail & Related papers  (2021-05-28T08:39:19Z)
- Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
 Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
 arXiv  Detail & Related papers  (2021-05-11T07:18:57Z)
- Generalized Few-Shot Video Classification with Video Retrieval and
  Feature Generation [132.82884193921535]
 We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
 arXiv  Detail & Related papers  (2020-07-09T13:05:32Z)
- Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
  Using Textual Data [1.004766879203303]
 We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
 arXiv  Detail & Related papers  (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.