Video-Helpful Multimodal Machine Translation
- URL: http://arxiv.org/abs/2310.20201v1
- Date: Tue, 31 Oct 2023 05:51:56 GMT
- Title: Video-Helpful Multimodal Machine Translation
- Authors: Yihang Li, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi, Wei Li
- Abstract summary: multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles.
We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs.
We propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation.
- Score: 36.9686296461948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multimodal machine translation (MMT) datasets consist of images and
video captions or instructional video subtitles, which rarely contain
linguistic ambiguity, making visual information ineffective in generating
appropriate translations. Recent work has constructed an ambiguous subtitles
dataset to alleviate this problem but is still limited to the problem that
videos do not necessarily contribute to disambiguation. We introduce EVA
(Extensive training set and Video-helpful evaluation set for Ambiguous
subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En)
parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs,
and corresponding video clips collected from movies and TV episodes. In
addition to the extensive training set, EVA contains a video-helpful evaluation
set in which subtitles are ambiguous, and videos are guaranteed helpful for
disambiguation. Furthermore, we propose SAFA, an MMT model based on the
Selective Attention model with two novel methods: Frame attention loss and
Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully.
Experiments on EVA show that visual information and the proposed methods can
boost translation performance, and our model performs significantly better than
existing MMT models. The EVA dataset and the SAFA model are available at:
https://github.com/ku-nlp/video-helpful-MMT.git.
Related papers
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - BigVideo: A Large-scale Video Subtitle Translation Dataset for
Multimodal Machine Translation [50.22200540985927]
We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation.
BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos.
To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder.
arXiv Detail & Related papers (2023-05-23T08:53:36Z) - VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine
Translation [24.99480715551902]
multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity.
We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips.
arXiv Detail & Related papers (2022-01-20T08:38:31Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.