Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text
Features
- URL: http://arxiv.org/abs/2212.10596v1
- Date: Tue, 20 Dec 2022 19:12:58 GMT
- Title: Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text
Features
- Authors: Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin
Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross
- Abstract summary: We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings.
We show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models.
- Score: 9.140696309404133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting actions in untrimmed videos should not be limited to a small,
closed set of classes. We present a simple, yet effective strategy for
open-vocabulary temporal action detection utilizing pretrained image-text
co-embeddings. Despite being trained on static images rather than videos, we
show that image-text co-embeddings enable openvocabulary performance
competitive with fully-supervised models. We show that the performance can be
further improved by ensembling the image-text features with features encoding
local motion, like optical flow based features, or other modalities, like
audio. In addition, we propose a more reasonable open-vocabulary evaluation
setting for the ActivityNet data set, where the category splits are based on
similarity rather than random assignment.
Related papers
- IFCap: Image-like Retrieval and Frequency-based Entity Filtering for
Zero-shot Captioning [3.8265756182141346]
We propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap.
Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features.
arXiv Detail & Related papers (2024-09-26T16:47:32Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary
Instructions [66.82547612097194]
We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions.
Our approach takes advantage of the unified visual-semantic embedding space pretrained on a general image-caption dataset.
We show promising results in manipulating open-vocabulary color, texture, and high-level attributes for various scenarios of open-domain images.
arXiv Detail & Related papers (2020-08-04T14:15:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.