Object-aware Video-language Pre-training for Retrieval
- URL: http://arxiv.org/abs/2112.00656v2
- Date: Thu, 2 Dec 2021 07:00:46 GMT
- Title: Object-aware Video-language Pre-training for Retrieval
- Authors: Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying
Shan, Xiaohu Qie, Mike Zheng Shou
- Abstract summary: We present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.
We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture.
- Score: 24.543719616308945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, by introducing large-scale dataset and strong transformer network,
video-language pre-training has shown great success especially for retrieval.
Yet, existing video-language transformer models do not explicitly fine-grained
semantic align. In this work, we present Object-aware Transformers, an
object-centric approach that extends video-language transformer to incorporate
object representations. The key idea is to leverage the bounding boxes and
object tags to guide the training process. We evaluate our model on three
standard sub-tasks of video-text matching on four widely used benchmarks. We
also provide deep analysis and detailed ablation about the proposed method. We
show clear improvement in performance across all tasks and datasets considered,
demonstrating the value of a model that incorporates object representations
into a video-language architecture. The code will be released at
\url{https://github.com/FingerRec/OA-Transformer}.
Related papers
- Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Patch-based Object-centric Transformers for Efficient Video Generation [71.55412580325743]
We present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture.
We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos.
Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information.
arXiv Detail & Related papers (2022-06-08T16:29:59Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.