Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding
- URL: http://arxiv.org/abs/2311.15075v1
- Date: Sat, 25 Nov 2023 17:01:38 GMT
- Title: Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding
- Authors: Ruyang Liu and Jingjia Huang and Wei Gao and Thomas H. Li and Ge Li
- Abstract summary: We propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN) to extend image-text model to diverse video tasks and video-text data.
Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages.
- Score: 47.97650346560239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale image-language pretrained models, e.g., CLIP, have demonstrated
remarkable proficiency in acquiring general multi-modal knowledge through
web-scale image-text data. Despite the impressive performance of image-language
models on various image tasks, how to effectively expand them on general video
understanding remains an area of ongoing exploration. In this paper, we
investigate the image-to-video transferring from the perspective of the model
and the data, unveiling two key obstacles impeding the adaptation of
image-language models: non-generalizable temporal modeling and partially
misaligned video-text data. To address these challenges, we propose
Spatial-Temporal Auxiliary Network with Mutual-guided alignment module
(Mug-STAN), a simple yet effective framework extending image-text model to
diverse video tasks and video-text data.Specifically, STAN adopts a branch
structure with decomposed spatial-temporal modules to enable generalizable
temporal modeling, while Mug suppresses misalignment by introducing token-wise
feature aggregation of either modality from the other. Extensive experimental
results verify Mug-STAN significantly improves adaptation of language-image
pretrained models such as CLIP and CoCa at both video-text post-pretraining and
finetuning stages. With our solution, state-of-the-art zero-shot and finetuning
results on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC,
Kinetics-400, Something-Something-2, HMDB-51, UCF- 101, and AVA, are achieved.
Moreover, by integrating pretrained Mug-STAN with the emerging multimodal
dialogue model, we can realize zero-shot video chatting. Codes are available at
https://github.com/farewellthree/STAN
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos.
Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.
We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.