MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge
- URL: http://arxiv.org/abs/2303.08914v2
- Date: Sat, 22 Jul 2023 09:49:15 GMT
- Title: MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge
- Authors: Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz
Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof
- Abstract summary: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
- Score: 35.45809761628721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large scale Vision-Language (VL) models have shown tremendous success in
aligning representations between visual and text modalities. This enables
remarkable progress in zero-shot recognition, image generation & editing, and
many other exciting tasks. However, VL models tend to over-represent objects
while paying much less attention to verbs, and require additional tuning on
video data for best zero-shot action recognition performance. While previous
work relied on large-scale, fully-annotated data, in this work we propose an
unsupervised approach. We adapt a VL model for zero-shot and few-shot action
recognition using a collection of unlabeled videos and an unpaired action
dictionary. Based on that, we leverage Large Language Models and VL models to
build a text bag for each unlabeled video via matching, text expansion and
captioning. We use those bags in a Multiple Instance Learning setup to adapt an
image-text backbone to video data. Although finetuned on unlabeled video data,
our resulting models demonstrate high transferability to numerous unseen
zero-shot downstream tasks, improving the base VL model performance by up to
14\%, and even comparing favorably to fully-supervised baselines in both
zero-shot and few-shot video recognition transfer. The code will be released
later at \url{https://github.com/wlin-at/MAXI}.
Related papers
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [27.04277811443469]
Video-LLaVA learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
arXiv Detail & Related papers (2023-11-16T10:59:44Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based
Learning for Vision-Language Models [50.27305012063483]
FewVLM is a few-shot prompt-based learner on vision-language tasks.
We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM)
We observe that prompts significantly affect zero-shot performance but marginally affect few-shot performance.
arXiv Detail & Related papers (2021-10-16T06:07:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.