Expanding the Vocabulary of BERT for Knowledge Base Construction
- URL: http://arxiv.org/abs/2310.08291v1
- Date: Thu, 12 Oct 2023 12:52:46 GMT
- Title: Expanding the Vocabulary of BERT for Knowledge Base Construction
- Authors: Dong Yang, Xu Wang, Remzi Celebi
- Abstract summary: "Knowledge Base Construction from Pretrained Language Models" challenge was held at International Semantic Web Conference 2023.
Our focus was on Track 1 of the challenge, where the parameters are constrained to a maximum of 1 billion.
We present Vocabulary Expandable BERT for knowledge base construction, which expand the language model's vocabulary while preserving semantic embeddings.
- Score: 6.412048788884728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge base construction entails acquiring structured information to
create a knowledge base of factual and relational data, facilitating question
answering, information retrieval, and semantic understanding. The challenge
called "Knowledge Base Construction from Pretrained Language Models" at
International Semantic Web Conference 2023 defines tasks focused on
constructing knowledge base using language model. Our focus was on Track 1 of
the challenge, where the parameters are constrained to a maximum of 1 billion,
and the inclusion of entity descriptions within the prompt is prohibited.
Although the masked language model offers sufficient flexibility to extend
its vocabulary, it is not inherently designed for multi-token prediction. To
address this, we present Vocabulary Expandable BERT for knowledge base
construction, which expand the language model's vocabulary while preserving
semantic embeddings for newly added words. We adopt task-specific
re-pre-training on masked language model to further enhance the language model.
Through experimentation, the results show the effectiveness of our
approaches. Our framework achieves F1 score of 0.323 on the hidden test set and
0.362 on the validation set, both data set is provided by the challenge.
Notably, our framework adopts a lightweight language model (BERT-base, 0.13
billion parameters) and surpasses the model using prompts directly on large
language model (Chatgpt-3, 175 billion parameters). Besides, Token-Recode
achieves comparable performances as Re-pretrain. This research advances
language understanding models by enabling the direct embedding of multi-token
entities, signifying a substantial step forward in link prediction task in
knowledge graph and metadata completion in data management.
Related papers
- Pretrained Generative Language Models as General Learning Frameworks for
Sequence-Based Tasks [0.0]
We propose that small pretrained foundational generative language models can be utilized as a general learning framework for sequence-based tasks.
Our proposal overcomes the computational resource, skill set, and timeline challenges associated with training neural networks and language models from scratch.
We demonstrate that 125M, 350M, and 1.3B parameter pretrained foundational language models can be instruction fine-tuned with 10,000-to-1,000,000 instruction examples.
arXiv Detail & Related papers (2024-02-08T12:19:32Z) - LLM2KB: Constructing Knowledge Bases using instruction tuned context
aware Large Language Models [0.8702432681310401]
Our paper proposes LLM2KB, a system for constructing knowledge bases using large language models.
Our best performing model achieved an average F1 score of 0.6185 across 21 relations in the LM-KBC challenge held at the ISWC 2023 conference.
arXiv Detail & Related papers (2023-08-25T07:04:16Z) - Bridging the Gap: Deciphering Tabular Data Using Large Language Model [4.711941969101732]
This research marks the first application of large language models to table-based question answering tasks.
We have architected a distinctive module dedicated to the serialization of tables for seamless integration with expansive language models.
arXiv Detail & Related papers (2023-08-23T03:38:21Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - DeepStruct: Pretraining of Language Models for Structure Prediction [64.84144849119554]
We pretrain language models on a collection of task-agnostic corpora to generate structures from text.
Our structure pretraining enables zero-shot transfer of the learned knowledge that models have about the structure tasks.
We show that a 10B parameter language model transfers non-trivially to most tasks and obtains state-of-the-art performance on 21 of 28 datasets.
arXiv Detail & Related papers (2022-05-21T00:58:22Z) - Pretraining Approaches for Spoken Language Recognition: TalTech
Submission to the OLR 2021 Challenge [0.0]
The paper is based on our submission to the Oriental Language Recognition 2021 Challenge.
For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition.
For the unconstrained task, we relied on both externally available pretrained models as well as external data.
arXiv Detail & Related papers (2022-05-14T15:17:08Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.