Related papers: Multi-modal Transformer for Video Retrieval

Multi-modal Transformer for Video Retrieval

URL: http://arxiv.org/abs/2007.10639v1
Date: Tue, 21 Jul 2020 07:38:46 GMT
Title: Multi-modal Transformer for Video Retrieval
Authors: Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
Abstract summary: We present a multi-modal transformer to jointly encode the different modalities in video. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
Score: 67.86763073161012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

Related papers

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Joint Moment Retrieval and Highlight Detection Via Natural Language Queries [0.0]
We propose a new method for natural language query based joint video summarization and highlight detection. This approach will use both visual and audio cues to match a user's natural language query to retrieve the most relevant and interesting moments from a video. Our approach employs multiple recent techniques used in Vision Transformers (ViTs) to create a transformer-like encoder-decoder model.
arXiv Detail & Related papers (2023-05-08T18:00:33Z)
Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames. MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders. MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [36.50847375135979]
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation. We present a multi-modal, modality fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation.
arXiv Detail & Related papers (2021-12-08T18:14:57Z)
Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning [92.17835753226333]
We propose a novel unsupervised video-to-video translation model. Our model decomposes the style and the content using the specialized UV-decoder structure. Our model can produce photo-realistic videos in a multimodal way.
arXiv Detail & Related papers (2020-04-14T13:44:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.