LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits
Siamese-BLOOM
- URL: http://arxiv.org/abs/2305.06404v1
- Date: Wed, 10 May 2023 18:26:42 GMT
- Title: LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits
Siamese-BLOOM
- Authors: Wen-Yu Hua and Brian Williams and Davood Shamsi
- Abstract summary: We present 8-bit Siamese-BLOOM, a multilingual large language model optimized to produce semantically meaningful word embeddings.
We fine-tune BLOOM with a scalable adapter (LoRA) and 8-bit Adam for sentence similarity classification.
Experiment results show the quality of learned embeddings from LACoS-BLOOM is proportional to the number of model parameters and the amount of unlabeled training data.
- Score: 2.9327503320877457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text embeddings are useful features for several NLP applications, such as
sentence similarity, text clustering, and semantic search. In this paper, we
present a Low-rank Adaptation with a Contrastive objective on top of 8-bit
Siamese-BLOOM, a multilingual large language model optimized to produce
semantically meaningful word embeddings. The innovation is threefold. First, we
cast BLOOM weights to 8-bit values. Second, we fine-tune BLOOM with a scalable
adapter (LoRA) and 8-bit Adam optimizer for sentence similarity classification.
Third, we apply a Siamese architecture on BLOOM model with a contrastive
objective to ease the multi-lingual labeled data scarcity. The experiment
results show the quality of learned embeddings from LACoS-BLOOM is proportional
to the number of model parameters and the amount of unlabeled training data.
With the parameter efficient fine-tuning design, we are able to run BLOOM 7.1
billion parameters end-to-end on a single GPU machine with 32GB memory.
Compared to previous solution Sentence-BERT, we achieve significant improvement
on both English and multi-lingual STS tasks.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA [3.195234044113248]
We introduce a Large Language Model (LLM) based on the novel Meta LLaMA-3 model: LLaMAntino-3-ANITA-8B-Inst-DPO-ITA.
We fine-tuned the original 8B parameters instruction tuned model using the Supervised Fine-tuning (SFT) technique on the English and Italian language datasets.
A Dynamic Preference Optimization (DPO) process has been used to align preferences, avoid dangerous and inappropriate answers, and limit biases and prejudices.
arXiv Detail & Related papers (2024-05-11T22:02:55Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Mini Minds: Exploring Bebeshka and Zlata Baby Models [3.558894829990311]
We describe the University of Lyon 2 submission to the Strict-Small track of the BabyLM competition.
We introduce two small-size language models (LMs) that were submitted for evaluation.
Despite being half the scale of the baseline LMs, our proposed models achieve comparable performance.
arXiv Detail & Related papers (2023-11-06T16:01:10Z) - Binary and Ternary Natural Language Generation [24.295815261826153]
Ternary and binary neural networks enable multiplication-free computation.
They promise multiple orders of magnitude efficiency gains over full-precision networks.
However, such networks have proven very difficult to optimize.
We show first ternary and binary transformer models on the downstream tasks of summarization and machine translation.
arXiv Detail & Related papers (2023-06-02T18:01:02Z) - BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting [50.24676567971536]
The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages.
We apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages.
We conclude that with sufficient training data language adaptation can generalize well to diverse languages.
arXiv Detail & Related papers (2022-12-19T15:24:45Z) - Jam or Cream First? Modeling Ambiguity in Neural Machine Translation
with SCONES [10.785577504399077]
We propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively.
We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function.
We demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations.
arXiv Detail & Related papers (2022-05-02T07:51:37Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.