ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
Scene Graph
- URL: http://arxiv.org/abs/2006.16934v3
- Date: Fri, 19 Mar 2021 05:17:32 GMT
- Title: ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
Scene Graph
- Authors: Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng
Wang
- Abstract summary: ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language.
ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks.
ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.
- Score: 38.97228345655337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates
structured knowledge obtained from scene graphs to learn joint representations
of vision-language. ERNIE-ViL tries to build the detailed semantic connections
(objects, attributes of objects and relationships between objects) across
vision and language, which are essential to vision-language cross-modal tasks.
Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph
Prediction tasks, i.e., Object Prediction, Attribute Prediction and
Relationship Prediction tasks in the pre-training phase. Specifically, these
prediction tasks are implemented by predicting nodes of different types in the
scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint
representations characterizing the alignments of the detailed semantics across
vision and language. After pre-training on large scale image-text aligned
datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal
downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these
tasks and ranks the first place on the VCR leaderboard with an absolute
improvement of 3.7%.
Related papers
- Towards Interpreting Visual Information Processing in Vision-Language Models [24.51408101801313]
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images.
We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM.
arXiv Detail & Related papers (2024-10-09T17:55:02Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - Vision-Language Pre-training with Object Contrastive Learning for 3D
Scene Understanding [47.48443919164377]
A vision-language pre-training framework is proposed to transfer flexibly on 3D vision-language downstream tasks.
In this paper, we investigate three common tasks in semantic 3D scene understanding, and derive key insights into a pre-training model.
Experiments verify the excellent performance of the framework on three 3D vision-language tasks.
arXiv Detail & Related papers (2023-05-18T05:25:40Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships [17.930724926012264]
We introduce a new task that targets on inducing a joint vision-language structure in an unsupervised manner.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
We propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones.
arXiv Detail & Related papers (2022-03-27T09:51:34Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Linguistically-aware Attention for Reducing the Semantic-Gap in
Vision-Language Tasks [9.462808515258464]
We propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors.
LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process.
We apply and demonstrate the effectiveness of LAT in three Vision-language (V-L) tasks: Counting-VQA, VQA, and Image captioning.
arXiv Detail & Related papers (2020-08-18T16:29:49Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.