Rethinking Relational Encoding in Language Model: Pre-Training for
General Sequences
- URL: http://arxiv.org/abs/2103.10334v1
- Date: Thu, 18 Mar 2021 15:51:04 GMT
- Title: Rethinking Relational Encoding in Language Model: Pre-Training for
General Sequences
- Authors: Matthew B. A. McDermott, Brendan Yap, Peter Szolovits, Marinka Zitnik
- Abstract summary: Language model pre-training fails at modeling per-sequence relations in non-natural language domains.
We develop a framework that couples LMPT with deep structure-preserving metric learning to produce richer embeddings.
Our approach offers notable performance improvements on downstream tasks.
- Score: 23.806325599416134
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language model pre-training (LMPT) has achieved remarkable results in natural
language understanding. However, LMPT is much less successful in non-natural
language domains like protein sequences, revealing a crucial discrepancy
between the various sequential domains. Here, we posit that while LMPT can
effectively model per-token relations, it fails at modeling per-sequence
relations in non-natural language domains. To this end, we develop a framework
that couples LMPT with deep structure-preserving metric learning to produce
richer embeddings than can be obtained from LMPT alone. We examine new and
existing pre-training models in this framework and theoretically analyze the
framework overall. We also design experiments on a variety of synthetic
datasets and new graph-augmented datasets of proteins and scientific abstracts.
Our approach offers notable performance improvements on downstream tasks,
including prediction of protein remote homology and classification of citation
intent.
Related papers
- Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences.
We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z) - Evaluating Neural Language Models as Cognitive Models of Language
Acquisition [4.779196219827507]
We argue that some of the most prominent benchmarks for evaluating the syntactic capacities of neural language models may not be sufficiently rigorous.
When trained on small-scale data modeling child language acquisition, the LMs can be readily matched by simple baseline models.
We conclude with suggestions for better connecting LMs with the empirical study of child language acquisition.
arXiv Detail & Related papers (2023-10-31T00:16:17Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Linguistically inspired roadmap for building biologically reliable
protein language models [0.5412332666265471]
We argue that guidance drawn from linguistics can aid with building more interpretable protein LMs.
We provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation.
arXiv Detail & Related papers (2022-07-03T08:42:44Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Unnatural Language Processing: Bridging the Gap Between Synthetic and
Natural Language Data [37.542036032277466]
We introduce a technique for -simulation-to-real'' transfer in language understanding problems.
Our approach matches or outperforms state-of-the-art models trained on natural language data in several domains.
arXiv Detail & Related papers (2020-04-28T16:41:00Z) - Logical Natural Language Generation from Open-Domain Tables [107.04385677577862]
We propose a new task where a model is tasked with generating natural language statements that can be emphlogically entailed by the facts.
To facilitate the study of the proposed logical NLG problem, we use the existing TabFact dataset citechen 2019tabfact featured with a wide range of logical/symbolic inferences.
The new task poses challenges to the existing monotonic generation frameworks due to the mismatch between sequence order and logical order.
arXiv Detail & Related papers (2020-04-22T06:03:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.