Synchronized Audio-Visual Frames with Fractional Positional Encoding for
Transformers in Video-to-Text Translation
- URL: http://arxiv.org/abs/2112.14088v1
- Date: Tue, 28 Dec 2021 10:57:18 GMT
- Title: Synchronized Audio-Visual Frames with Fractional Positional Encoding for
Transformers in Video-to-Text Translation
- Authors: Philipp Harzig, Moritz Einfalt, Rainer Lienhart
- Abstract summary: Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips.
Transformers have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT.
We explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture.
- Score: 26.36252496316238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-to-Text (VTT) is the task of automatically generating descriptions for
short audio-visual video clips, which can support visually impaired people to
understand scenes of a YouTube video for instance. Transformer architectures
have shown great performance in both machine translation and image captioning,
lacking a straightforward and reproducible application for VTT. However, there
is no comprehensive study on different strategies and advice for video
description generation including exploiting the accompanying audio with fully
self-attentive networks. Thus, we explore promising approaches from image
captioning and video processing and apply them to VTT by developing a
straightforward Transformer architecture. Additionally, we present a novel way
of synchronizing audio and video features in Transformers which we call
Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX
dataset to determine a configuration applicable to unseen datasets that helps
describe short video clips in natural language and improved the CIDEr and
BLEU-4 scores by 37.13 and 12.83 points compared to a vanilla Transformer
network and achieve state-of-the-art results on the MSR-VTT and MSVD datasets.
Also, FPE helps increase the CIDEr score by a relative factor of 8.6%.
Related papers
- HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - Auto-captions on GIF: A Large-scale Video-sentence Dataset for
Vision-language Pre-training [112.91603911837436]
Auto-captions on GIF dataset is a new large-scale pre-training dataset for generic video understanding.
All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages.
arXiv Detail & Related papers (2020-07-05T16:11:57Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.