ActBERT: Learning Global-Local Video-Text Representations
- URL: http://arxiv.org/abs/2011.07231v1
- Date: Sat, 14 Nov 2020 07:14:08 GMT
- Title: ActBERT: Learning Global-Local Video-Text Representations
- Authors: Linchao Zhu, Yi Yang
- Abstract summary: We introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data.
We leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects.
ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.
- Score: 74.29748531654474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce ActBERT for self-supervised learning of joint
video-text representations from unlabeled data. First, we leverage global
action information to catalyze the mutual interactions between linguistic texts
and local regional objects. It uncovers global and local visual clues from
paired video sequences and text descriptions for detailed visual and text
relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to
encode three sources of information, i.e., global actions, local regional
objects, and linguistic descriptions. Global-local correspondences are
discovered via judicious clues extraction from contextual information. It
enforces the joint videotext representation to be aware of fine-grained objects
as well as global human intention. We validate the generalization capability of
ActBERT on downstream video-and language tasks, i.e., text-video clip
retrieval, video captioning, video question answering, action segmentation, and
action step localization. ActBERT significantly outperforms the
state-of-the-arts, demonstrating its superiority in video-text representation
learning.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Video Summarization: Towards Entity-Aware Captions [75.71891605682931]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - HANet: Hierarchical Alignment Networks for Video-Text Retrieval [15.91922397215452]
Video-text retrieval is an important yet challenging task in vision-language understanding.
Most current works simply measure the video-text similarity based on video-level and text-level embeddings.
We propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching.
arXiv Detail & Related papers (2021-07-26T09:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.