Contrastive Graph Multimodal Model for Text Classification in Videos
- URL: http://arxiv.org/abs/2206.02343v1
- Date: Mon, 6 Jun 2022 04:06:21 GMT
- Title: Contrastive Graph Multimodal Model for Text Classification in Videos
- Authors: Ye Liu and Changchong Lu and Chen Lin and Di Yin and Bo Ren
- Abstract summary: We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
- Score: 9.218562155255233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The extraction of text information in videos serves as a critical step
towards semantic understanding of videos. It usually involved in two steps: (1)
text recognition and (2) text classification. To localize texts in videos, we
can resort to large numbers of text recognition methods based on OCR
technology. However, to our knowledge, there is no existing work focused on the
second step of video text classification, which will limit the guidance to
downstream tasks such as video indexing and browsing. In this paper, we are the
first to address this new task of video text classification by fusing
multimodal information to deal with the challenging scenario where different
types of video texts may be confused with various colors, unknown fonts and
complex layouts. In addition, we tailor a specific module called CorrelationNet
to reinforce feature representation by explicitly extracting layout
information. Furthermore, contrastive learning is utilized to explore inherent
connections between samples using plentiful unlabeled videos. Finally, we
construct a new well-defined industrial dataset from the news domain, called
TI-News, which is dedicated to building and evaluating video text recognition
and classification applications. Extensive experiments on TI-News demonstrate
the effectiveness of our method.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
Retrieval [23.418120617544545]
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years.
In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment.
To strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks.
arXiv Detail & Related papers (2023-01-30T03:53:19Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.