Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision
- URL: http://arxiv.org/abs/2010.06775v1
- Date: Wed, 14 Oct 2020 02:11:51 GMT
- Title: Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision
- Authors: Hao Tan, Mohit Bansal
- Abstract summary: We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
- Score: 110.66085917826648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans learn language by listening, speaking, writing, reading, and also, via
interaction with the multimodal real world. Existing language pre-training
frameworks show the effectiveness of text-only self-supervision while we
explore the idea of a visually-supervised language model in this paper. We find
that the main reason hindering this exploration is the large divergence in
magnitude and distributions between the visually-grounded language datasets and
pure-language corpora. Therefore, we develop a technique named "vokenization"
that extrapolates multimodal alignments to language-only data by contextually
mapping language tokens to their related images (which we call "vokens"). The
"vokenizer" is trained on relatively small image captioning datasets and we
then apply it to generate vokens for large language corpora. Trained with these
contextually generated vokens, our visually-supervised language models show
consistent improvements over self-supervised alternatives on multiple
pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models
publicly available at https://github.com/airsplay/vokenization
Related papers
- Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations.
LCG outperforms standard language-only models in learning efficiency.
It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages [3.3227703089509304]
We propose a simple yet efficient approach to adapt Vision-Language Pre-training to unseen languages using MPLM.
Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data.
arXiv Detail & Related papers (2023-06-29T08:20:57Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Improving Zero-Shot Multi-Lingual Entity Linking [14.502266106371433]
We consider multilingual entity linking, where a single model is trained to link references to same-language knowledge bases in several languages.
We propose a neural ranker architecture, which leverages multilingual transformer representations of text to be easily applied to a multilingual setting.
We find that using this approach improves recall in several datasets, often matching the in-language performance.
arXiv Detail & Related papers (2021-04-16T12:50:07Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - Visual Grounding in Video for Unsupervised Word Translation [91.47607488740647]
We use visual grounding to improve unsupervised word mapping between languages.
We learn embeddings from unpaired instructional videos narrated in the native language.
We apply these methods to translate words from English to French, Korean, and Japanese.
arXiv Detail & Related papers (2020-03-11T02:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.