VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine
Translation
- URL: http://arxiv.org/abs/2201.08054v2
- Date: Fri, 21 Jan 2022 06:54:02 GMT
- Title: VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine
Translation
- Authors: Yihang Li, Shuichiro Shimizu, Weiqi Gu, Chenhui Chu, Sadao Kurohashi
- Abstract summary: multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity.
We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding video clips.
- Score: 24.99480715551902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multimodal machine translation (MMT) datasets consist of images and
video captions or general subtitles, which rarely contain linguistic ambiguity,
making visual information not so effective to generate appropriate
translations. We introduce VISA, a new dataset that consists of 40k
Japanese-English parallel sentence pairs and corresponding video clips with the
following key features: (1) the parallel sentences are subtitles from movies
and TV episodes; (2) the source subtitles are ambiguous, which means they have
multiple possible translations with different meanings; (3) we divide the
dataset into Polysemy and Omission according to the cause of ambiguity. We show
that VISA is challenging for the latest MMT system, and we hope that the
dataset can facilitate MMT research.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z) - Video-Helpful Multimodal Machine Translation [36.9686296461948]
multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles.
We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs.
We propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation.
arXiv Detail & Related papers (2023-10-31T05:51:56Z) - BigVideo: A Large-scale Video Subtitle Translation Dataset for
Multimodal Machine Translation [50.22200540985927]
We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation.
BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos.
To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder.
arXiv Detail & Related papers (2023-05-23T08:53:36Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - MultiSubs: A Large-scale Multimodal and Multilingual Dataset [32.48454703822847]
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language.
The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles.
We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation.
arXiv Detail & Related papers (2021-03-02T18:09:07Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.