Video-Text Pre-training with Learned Regions
- URL: http://arxiv.org/abs/2112.01194v2
- Date: Mon, 6 Dec 2021 07:25:35 GMT
- Title: Video-Text Pre-training with Learned Regions
- Authors: Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin,
Guanyu Cai, Jinhui Tang
- Abstract summary: Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
- Score: 59.30893505895156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-Text pre-training aims at learning transferable representations from
large-scale video-text pairs via aligning the semantics between visual and
textual information. State-of-the-art approaches extract visual features from
raw pixels in an end-to-end fashion. However, these methods operate at
frame-level directly and thus overlook the spatio-temporal structure of objects
in video, which yet has a strong synergy with nouns in textual descriptions. In
this work, we propose a simple yet effective module for video-text
representation learning, namely RegionLearner, which can take into account the
structure of objects during pre-training on large-scale video-text pairs. Given
a video, our module (1) first quantizes visual features into semantic clusters,
then (2) generates learnable masks and uses them to aggregate the features
belonging to the same semantic region, and finally (3) models the interactions
between different aggregated regions. In contrast to using off-the-shelf object
detectors, our proposed module does not require explicit supervision and is
much more computationally efficient. We pre-train the proposed approach on the
public WebVid2M and CC3M datasets. Extensive evaluations on four downstream
video-text retrieval benchmarks clearly demonstrate the effectiveness of our
RegionLearner. The code will be available at
https://github.com/ruiyan1995/Region_Learner.
Related papers
- Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval [43.2299969152561]
Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
arXiv Detail & Related papers (2022-04-26T16:06:31Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.