FoodGPT: A Large Language Model in Food Testing Domain with Incremental
Pre-training and Knowledge Graph Prompt
- URL: http://arxiv.org/abs/2308.10173v1
- Date: Sun, 20 Aug 2023 05:58:33 GMT
- Title: FoodGPT: A Large Language Model in Food Testing Domain with Incremental
Pre-training and Knowledge Graph Prompt
- Authors: Zhixiao Qi, Yijiong Yu, Meiqi Tu, Junyi Tan, Yongfeng Huang
- Abstract summary: We build a large language model for food testing.
In this paper, we propose a method for handling structured knowledge and scanned documents in incremental pre-training.
- Score: 18.7168443402118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, the construction of large language models in specific domains is
done by fine-tuning on a base model. Some models also incorporate knowledge
bases without the need for pre-training. This is because the base model already
contains domain-specific knowledge during the pre-training process. We build a
large language model for food testing. Unlike the above approach, a significant
amount of data in this domain exists in Scanning format for domain standard
documents. In addition, there is a large amount of untrained structured
knowledge. Therefore, we introduce an incremental pre-training step to inject
this knowledge into a large language model. In this paper, we propose a method
for handling structured knowledge and scanned documents in incremental
pre-training. To overcome the problem of machine hallucination, we constructe a
knowledge graph to serve as an external knowledge base for supporting retrieval
in the large language model. It is worth mentioning that this paper is a
technical report of our pre-release version, and we will report our specific
experimental data in future versions.
Related papers
- Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval [31.9252824152673]
We build on previous research that demonstrated loss of information in the middle of input sequences for causal language models.
We examine positional biases at various stages of training for an encoder-decoder model, including language model pre-training, contrastive pre-training, and contrastive fine-tuning.
arXiv Detail & Related papers (2024-04-05T15:16:16Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Adapting Large Language Models to Domains via Reading Comprehension [86.24451681746676]
We explore how continued pre-training on domain-specific corpora influences large language models.
We show that training on the raw corpora endows the model with domain knowledge, but drastically hurts its ability for question answering.
We propose a simple method for transforming raw corpora into reading comprehension texts.
arXiv Detail & Related papers (2023-09-18T07:17:52Z) - Large Language Models Struggle to Learn Long-Tail Knowledge [39.01608375863687]
We study the relationship between the knowledge memorized by large language models and the information in pre-training datasets scraped from the web.
In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training.
arXiv Detail & Related papers (2022-11-15T18:49:27Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods
in Natural Language Processing [78.8500633981247]
This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning"
Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
arXiv Detail & Related papers (2021-07-28T18:09:46Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - REALM: Retrieval-Augmented Language Model Pre-Training [37.3178586179607]
We augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia.
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner.
We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA)
arXiv Detail & Related papers (2020-02-10T18:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.