Keyframe Segmentation and Positional Encoding for Video-guided Machine
Translation Challenge 2020
- URL: http://arxiv.org/abs/2006.12799v1
- Date: Tue, 23 Jun 2020 07:15:11 GMT
- Title: Keyframe Segmentation and Positional Encoding for Video-guided Machine
Translation Challenge 2020
- Authors: Tosho Hirasawa and Zhishen Yang and Mamoru Komachi and Naoaki Okazaki
- Abstract summary: We presented our video-guided machine translation system in approaching the Video-guided Machine Translation Challenge 2020.
In the evaluation phase, our system scored 36.60 corpus-level BLEU-4 and achieved the 1st place on the Video-guided Machine Translation Challenge 2020.
- Score: 28.38178018722211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-guided machine translation as one of multimodal neural machine
translation tasks targeting on generating high-quality text translation by
tangibly engaging both video and text. In this work, we presented our
video-guided machine translation system in approaching the Video-guided Machine
Translation Challenge 2020. This system employs keyframe-based video feature
extractions along with the video feature positional encoding. In the evaluation
phase, our system scored 36.60 corpus-level BLEU-4 and achieved the 1st place
on the Video-guided Machine Translation Challenge 2020.
Related papers
- VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.
We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - Applying Automated Machine Translation to Educational Video Courses [0.0]
We studied the capability of automated machine translation in the online video education space.
We applied text-to-speech synthesis and audio/video synchronization to build engaging videos in target languages.
arXiv Detail & Related papers (2023-01-09T01:44:29Z) - The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline
Shared Task [92.5087402621697]
This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task.
The YiTrans system is built on large-scale pre-trained encoder-decoder models.
Our final submissions rank first on English-German and English-Chinese end-to-end systems in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2022-06-12T16:13:01Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - NITS-VC System for VATEX Video Captioning Challenge 2020 [16.628598778804403]
We employ an encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D)
Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and private test data sets respectively.
arXiv Detail & Related papers (2020-06-07T06:39:56Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z) - Unsupervised Multimodal Video-to-Video Translation via Self-Supervised
Learning [92.17835753226333]
We propose a novel unsupervised video-to-video translation model.
Our model decomposes the style and the content using the specialized UV-decoder structure.
Our model can produce photo-realistic videos in a multimodal way.
arXiv Detail & Related papers (2020-04-14T13:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.