XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding
- URL: http://arxiv.org/abs/2204.07316v1
- Date: Fri, 15 Apr 2022 03:44:00 GMT
- Title: XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding
- Authors: Chan-Jan Hsu, Hung-yi Lee and Yu Tsao
- Abstract summary: This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
- Score: 73.24847320536813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models are widely used in natural language understanding
(NLU) tasks, and multimodal transformers have been effective in visual-language
tasks. This study explores distilling visual information from pretrained
multimodal transformers to pretrained language encoders. Our framework is
inspired by cross-modal encoders' success in visual-language tasks while we
alter the learning objective to cater to the language-heavy characteristics of
NLU. After training with a small number of extra adapting steps and finetuned,
the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in
general language understanding evaluation (GLUE), situations with adversarial
generations (SWAG) benchmarks, and readability benchmarks. We analyze the
performance of XDBERT on GLUE to show that the improvement is likely visually
grounded.
Related papers
- MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models [79.0546136194314]
We present a novel instruction tuning recipe to improve the zero-shot task generalization of multimodal large language models.
We evaluate the performance of the proposed approach on 9 unseen datasets across both language and vision modalities.
arXiv Detail & Related papers (2024-11-15T20:09:59Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Towards Versatile and Efficient Visual Knowledge Integration into
Pre-trained Language Models with Cross-Modal Adapters [16.44174900423759]
We propose a new plug-and-play module, X-adapter, to leverage the aligned visual and textual knowledge learned in pre-trained vision-language models.
Our method can significantly improve the performance on object-color reasoning and natural language understanding tasks.
arXiv Detail & Related papers (2023-05-12T10:08:46Z) - Accessible Instruction-Following Agent [0.0]
We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-08T23:57:26Z) - OmDet: Large-scale vision-language multi-dataset pre-training with
multimodal detection network [17.980765138522322]
This work introduces OmDet, a novel language-aware object detection architecture.
Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets.
We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding.
arXiv Detail & Related papers (2022-09-10T14:25:14Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Incorporating Linguistic Knowledge for Abstractive Multi-document
Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model.
We process the dependency information into the linguistic-guided attention mechanism.
With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z) - DIET: Lightweight Language Understanding for Dialogue Systems [0.0]
Large-scale pre-trained language models have shown impressive results on language understanding benchmarks like GLUE and SuperGLUE.
We introduce the Dual Intent and Entity Transformer (DIET) architecture, and study the effectiveness of different pre-trained representations on intent and entity prediction.
arXiv Detail & Related papers (2020-04-21T12:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.