VisualSem: A High-quality Knowledge Graph for Vision and Language
- URL: http://arxiv.org/abs/2008.09150v2
- Date: Wed, 20 Oct 2021 10:08:17 GMT
- Title: VisualSem: A High-quality Knowledge Graph for Vision and Language
- Authors: Houda Alberts, Teresa Huang, Yash Deshpande, Yibo Liu, Kyunghyun Cho,
Clara Vania, Iacer Calixto
- Abstract summary: We release VisualSem: a high-quality knowledge graph (KG)
VisualSem includes nodes with multilingual glosses, multiple illustrative images, and visually relevant relations.
We also release a neural multi-modal retrieval model that can use images or sentences as inputs and retrieves entities in the KG.
- Score: 48.47370435793127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An exciting frontier in natural language understanding (NLU) and generation
(NLG) calls for (vision-and-) language models that can efficiently access
external structured knowledge repositories. However, many existing knowledge
bases only cover limited domains, or suffer from noisy data, and most of all
are typically hard to integrate into neural language pipelines. To fill this
gap, we release VisualSem: a high-quality knowledge graph (KG) which includes
nodes with multilingual glosses, multiple illustrative images, and visually
relevant relations. We also release a neural multi-modal retrieval model that
can use images or sentences as inputs and retrieves entities in the KG. This
multi-modal retrieval model can be integrated into any (neural network) model
pipeline. We encourage the research community to use VisualSem for data
augmentation and/or as a source of grounding, among other possible uses.
VisualSem as well as the multi-modal retrieval models are publicly available
and can be downloaded in this URL: https://github.com/iacercalixto/visualsem
Related papers
- GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding [39.67113788660731]
We introduce a framework for developing Graph-aligned LAnguage Models (GLaM)
We demonstrate that grounding the models in specific graph-based knowledge expands the models' capacity for structure-based reasoning.
arXiv Detail & Related papers (2024-02-09T19:53:29Z) - Graph Neural Prompting with Large Language Models [32.97391910476073]
Graph Neural Prompting (GNP) is a novel plug-and-play method to assist pre-trained language models in learning beneficial knowledge from knowledge graphs.
Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks.
arXiv Detail & Related papers (2023-09-27T06:33:29Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - The All-Seeing Project: Towards Panoptic Visual Recognition and
Understanding of the Open World [71.52132776748628]
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions.
We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
arXiv Detail & Related papers (2023-08-03T17:59:47Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - Leveraging Graph-based Cross-modal Information Fusion for Neural Sign
Language Translation [46.825957917649795]
Sign Language (SL), as the mother tongue of the deaf community, is a special visual language that most hearing people cannot understand.
We propose a novel neural SLT model with multi-modal feature fusion based on the dynamic graph.
We are the first to introduce graph neural networks, for fusing multi-modal information, into neural sign language translation models.
arXiv Detail & Related papers (2022-11-01T15:26:22Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - Endowing Language Models with Multimodal Knowledge Graph Representations [47.22480859519051]
We use the recently released VisualSem KG as our external knowledge repository.
We retrieve entities from the KG and use their multimodal representations to improve downstream task performance.
arXiv Detail & Related papers (2022-06-27T10:10:42Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.