Cabrita: closing the gap for foreign languages
- URL: http://arxiv.org/abs/2308.11878v1
- Date: Wed, 23 Aug 2023 02:49:35 GMT
- Title: Cabrita: closing the gap for foreign languages
- Authors: Celio Larcher, Marcos Piau, Paulo Finardi, Pedro Gengo, Piero
Esposito, Vinicius Carid\'a
- Abstract summary: The strategy of training the model from scratch in a specific language or domain serves two essential purposes.
Main solution to overcome the cost challenge is to rely on available pre-trained models.
We present a methodology named Cabrita, which successfully addresses the performance and efficient tokenization problem.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The strategy of training the model from scratch in a specific language or
domain serves two essential purposes: i) enhancing performance in the
particular linguistic or domain context, and ii) ensuring effective
tokenization. The main limitation inherent to this approach lies in the
associated cost, which can reach six to seven-digit dollar values, depending on
the model size and the number of parameters involved.
The main solution to overcome the cost challenge is to rely on available
pre-trained models, which, despite recent advancements such as the LLaMA and
LLaMA-2 models, still demonstrate inefficiency for certain specific domain
problems or prove ineffective in scenarios involving conversational memory
resources, given the large number of tokens required to represent text.
To overcome this issue, we present a methodology named Cabrita, which, as our
research demonstrates, successfully addresses the performance and efficient
tokenization problem, all at an affordable cost. We believe that this
methodology can be applied to any transformer-like architecture model. To
validate the study, we conducted continuous pre-training exclusively using
Portuguese text on a 3-billion-parameter model known as OpenLLaMA, resulting in
a model named openCabrita 3B. The openCabrita 3B also features a new tokenizer
that results in a significant reduction in the number of tokens required to
represent the text. In our assessment, for few-shot learning tasks, we achieved
similar results with this 3B model compared to a traditional continuous
pre-training approach as well as to 7B models English pre-trained models.
Related papers
- The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - $C^3$: Confidence Calibration Model Cascade for Inference-Efficient
Cross-Lingual Natural Language Understanding [28.853593305486832]
Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP)
Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks.
Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models.
arXiv Detail & Related papers (2024-02-25T05:07:56Z) - PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs)
We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up.
In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z) - Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs.
We find that the tokenizer choice can significantly impact the model's downstream performance and training costs.
We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - DeBERTinha: A Multistep Approach to Adapt DebertaV3 XSmall for Brazilian
Portuguese Natural Language Processing Task [0.3499870393443269]
This paper presents an approach for adapting the DebertaV3 XSmall model pre-trained in English for Brazilian Portuguese natural language processing (NLP) tasks.
A key aspect of the methodology involves a multistep training process to ensure the model is effectively tuned for the Portuguese language.
The adapted model, called DeBERTinha, demonstrates effectiveness on downstream tasks like named entity recognition, sentiment analysis, and determining sentence relatedness.
arXiv Detail & Related papers (2023-09-28T20:53:25Z) - Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model.
We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns.
We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.