BigVideo: A Large-scale Video Subtitle Translation Dataset for
Multimodal Machine Translation
- URL: http://arxiv.org/abs/2305.18326v3
- Date: Mon, 3 Jul 2023 08:10:10 GMT
- Title: BigVideo: A Large-scale Video Subtitle Translation Dataset for
Multimodal Machine Translation
- Authors: Liyan Kang, Luyang Huang, Ningxin Peng, Peihao Zhu, Zewei Sun, Shanbo
Cheng, Mingxuan Wang, Degen Huang and Jinsong Su
- Abstract summary: We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation.
BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos.
To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder.
- Score: 50.22200540985927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a large-scale video subtitle translation dataset, BigVideo, to
facilitate the study of multi-modality machine translation. Compared with the
widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger,
consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also
introduce two deliberately designed test sets to verify the necessity of visual
information: Ambiguous with the presence of ambiguous words, and Unambiguous in
which the text context is self-contained for translation. To better model the
common semantics shared across texts and videos, we introduce a contrastive
learning method in the cross-modal encoder. Extensive experiments on the
BigVideo show that: a) Visual information consistently improves the NMT model
in terms of BLEU, BLEURT, and COMET on both Ambiguous and Unambiguous test
sets. b) Visual information helps disambiguation, compared to the strong text
baseline on terminology-targeted scores and human evaluation. Dataset and our
implementations are available at https://github.com/DeepLearnXMU/BigVideo-VMT.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Video-Helpful Multimodal Machine Translation [36.9686296461948]
multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles.
We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs.
We propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation.
arXiv Detail & Related papers (2023-10-31T05:51:56Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.