VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer
- URL: http://arxiv.org/abs/2107.02681v1
- Date: Tue, 6 Jul 2021 15:41:32 GMT
- Title: VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer
- Authors: Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal
- Abstract summary: We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
- Score: 76.3906723777229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since visual perception can give rich information beyond text descriptions
for world understanding, there has been increasing interest in leveraging
visual grounding for language learning. Recently, vokenization has attracted
attention by using the predictions of a text-to-image retrieval model as labels
for language model supervision. Despite its success, the method suffers from
approximation error of using finite image labels and the lack of vocabulary
diversity of a small image-text dataset. To overcome these limitations, we
present VidLanKD, a video-language knowledge distillation method for improving
language understanding. We train a multi-modal teacher model on a video-text
dataset, and then transfer its knowledge to a student language model with a
text dataset. To avoid approximation error, we propose to use different
knowledge distillation objectives. In addition, the use of a large-scale
video-text dataset helps learn diverse and richer vocabularies. In our
experiments, VidLanKD achieves consistent improvements over text-only language
models and vokenization models, on several downstream language understanding
tasks including GLUE, SQuAD, and SWAG. We also demonstrate the improved world
knowledge, physical reasoning, and temporal reasoning capabilities of our model
by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we
present comprehensive ablation studies as well as visualizations of the learned
text-to-video grounding results of our teacher and student language models. Our
code and models are available at: https://github.com/zinengtang/VidLanKD
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations.
LCG outperforms standard language-only models in learning efficiency.
It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z) - Expand BERT Representation with Visual Information via Grounded Language
Learning with Multimodal Partial Alignment [11.148099070407431]
GroundedBERT is a grounded language learning method that enhances the BERT representation with visually grounded information.
Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
arXiv Detail & Related papers (2023-12-04T03:16:48Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.