Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
Retrieval
- URL: http://arxiv.org/abs/2301.12644v1
- Date: Mon, 30 Jan 2023 03:53:19 GMT
- Title: Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
Retrieval
- Authors: Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, Ying Shan
- Abstract summary: Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years.
In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment.
To strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks.
- Score: 23.418120617544545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language alignment learning for video-text retrieval arouses a lot of
attention in recent years. Most of the existing methods either transfer the
knowledge of image-text pretraining model to video-text retrieval task without
fully exploring the multi-modal information of videos, or simply fuse
multi-modal features in a brute force manner without explicit guidance. In this
paper, we integrate multi-modal information in an explicit manner by tagging,
and use the tags as the anchors for better video-text alignment. Various
pretrained experts are utilized for extracting the information of multiple
modalities, including object, person, motion, audio, etc. To take full
advantage of these information, we propose the TABLE (TAgging Before aLignmEnt)
network, which consists of a visual encoder, a tag encoder, a text encoder, and
a tag-guiding cross-modal encoder for jointly encoding multi-frame visual
features and multi-modal tags information. Furthermore, to strengthen the
interaction between video and text, we build a joint cross-modal encoder with
the triplet input of [vision, tag, text] and perform two additional supervised
tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive
experimental results demonstrate that the TABLE model is capable of achieving
State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks,
including MSR-VTT, MSVD, LSMDC and DiDeMo.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.