VicTR: Video-conditioned Text Representations for Activity Recognition
- URL: http://arxiv.org/abs/2304.02560v2
- Date: Fri, 29 Mar 2024 16:56:33 GMT
- Title: VicTR: Video-conditioned Text Representations for Activity Recognition
- Authors: Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo,
- Abstract summary: We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
- Score: 73.09929391614266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction.
Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge [35.45809761628721]
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
arXiv Detail & Related papers (2023-03-15T20:17:41Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.