Related papers: Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

URL: http://arxiv.org/abs/2009.14109v2
Date: Wed, 30 Sep 2020 15:40:39 GMT
Title: Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
Authors: Charles Welch, Rada Mihalcea, Jonathan K. Kummerfeld
Abstract summary: We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.
Score: 47.08853566241831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

Related papers

Pretraining Data and Tokenizer for Indic LLM [1.7729311045335219]
We develop a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content.
arXiv Detail & Related papers (2024-07-17T11:06:27Z)
LexGen: Domain-aware Multilingual Lexicon Generation [40.97738267067852]
We propose a new model to generate dictionary words for $6$ Indian languages in the multi-domain setting.<n>Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique.<n>We release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains.
arXiv Detail & Related papers (2024-05-18T07:02:43Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation [19.624330093598996]
Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian.
arXiv Detail & Related papers (2023-10-05T11:45:29Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts. Data is thus a bottleneck for training effective sign language translation models. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z)
Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling [7.310390479801139]
Self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties. Our work opens up opportunities for developing DA models exploiting only MSA resources.
arXiv Detail & Related papers (2021-01-12T21:29:30Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.