Visual Grounding in Video for Unsupervised Word Translation
- URL: http://arxiv.org/abs/2003.05078v2
- Date: Thu, 26 Mar 2020 15:20:44 GMT
- Title: Visual Grounding in Video for Unsupervised Word Translation
- Authors: Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas
Smaira, Mateusz Malinowski, Jo\~ao Carreira, Phil Blunsom, Andrew Zisserman
- Abstract summary: We use visual grounding to improve unsupervised word mapping between languages.
We learn embeddings from unpaired instructional videos narrated in the native language.
We apply these methods to translate words from English to French, Korean, and Japanese.
- Score: 91.47607488740647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are thousands of actively spoken languages on Earth, but a single
visual world. Grounding in this visual world has the potential to bridge the
gap between all these languages. Our goal is to use visual grounding to improve
unsupervised word mapping between languages. The key idea is to establish a
common visual representation between two languages by learning embeddings from
unpaired instructional videos narrated in the native language. Given this
shared embedding we demonstrate that (i) we can map words between the
languages, particularly the 'visual' words; (ii) that the shared embedding
provides a good initialization for existing unsupervised text-based word
translation techniques, forming the basis for our proposed hybrid visual-text
mapping algorithm, MUVE; and (iii) our approach achieves superior performance
by addressing the shortcomings of text-based methods -- it is more robust,
handles datasets with less commonality, and is applicable to low-resource
languages. We apply these methods to translate words from English to French,
Korean, and Japanese -- all without any parallel corpora and simply by watching
many videos of people speaking while doing things.
Related papers
- Expand BERT Representation with Visual Information via Grounded Language
Learning with Multimodal Partial Alignment [11.148099070407431]
GroundedBERT is a grounded language learning method that enhances the BERT representation with visually grounded information.
Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
arXiv Detail & Related papers (2023-12-04T03:16:48Z) - Visual Grounding of Inter-lingual Word-Embeddings [6.136487946258519]
The present study investigates the inter-lingual visual grounding of word embeddings.
We focus on three languages in our experiments, namely, English, Arabic, and German.
Our experiments suggest that inter-lingual knowledge improves the performance of grounded embeddings in similar languages.
arXiv Detail & Related papers (2022-09-08T11:18:39Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Globetrotter: Unsupervised Multilingual Translation from Visual
Alignment [24.44204156935044]
We introduce a framework that uses the visual modality to align multiple languages.
We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations.
Our language representations are trained jointly in one model with a single stage.
arXiv Detail & Related papers (2020-12-08T18:50:40Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.