Related papers: Code Pretraining Improves Entity Tracking Abilities of Language Models

Code Pretraining Improves Entity Tracking Abilities of Language Models

URL: http://arxiv.org/abs/2405.21068v1
Date: Fri, 31 May 2024 17:56:33 GMT
Title: Code Pretraining Improves Entity Tracking Abilities of Language Models
Authors: Najoung Kim, Sebastian Schuster, Shubham Toshniwal,
Abstract summary: We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.
Score: 20.6768931196215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base models with additional code data. We extend this analysis to additionally examine the effect of math training, another highly structured data type, and alignment tuning, an important step for enhancing the usability of models. We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.

Related papers

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model [27.25688303240741]
KaLM-Embedding is a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance.
arXiv Detail & Related papers (2025-01-02T03:17:51Z)
Can bidirectional encoder become the ultimate winner for downstream applications of foundation models? [1.8120356834558644]
Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model.
arXiv Detail & Related papers (2024-11-27T03:31:14Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining [4.38070902806635]
We set up a benchmark for languages Croatian, Serbian, Bosnian and Montenegrin. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.
arXiv Detail & Related papers (2024-04-08T11:55:44Z)
Collaborative decoding of critical tokens for boosting factuality of large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation. The common practice of using sampling during generation also increases chances of hallucination. We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z)
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models. We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models. Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z)
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models. We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z)
RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling. Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training. Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases. By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z)
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings. We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z)
Interpreting Language Models Through Knowledge Graph Extraction [42.97929497661778]
We compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process. We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements. We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa)
arXiv Detail & Related papers (2021-11-16T15:18:01Z)
How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.