An Empirical Study of Translation Hypothesis Ensembling with Large
Language Models
- URL: http://arxiv.org/abs/2310.11430v1
- Date: Tue, 17 Oct 2023 17:40:21 GMT
- Title: An Empirical Study of Translation Hypothesis Ensembling with Large
Language Models
- Authors: Ant\'onio Farinhas, Jos\'e G. C. de Souza, Andr\'e F. T. Martins
- Abstract summary: Large language models (LLMs) are becoming a one-fits-many solution, but they sometimes hallucinate or produce unreliable output.
We investigate how hypothesis ensembling can improve the quality of the generated text.
- Score: 9.068791020917217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are becoming a one-fits-many solution, but they
sometimes hallucinate or produce unreliable output. In this paper, we
investigate how hypothesis ensembling can improve the quality of the generated
text for the specific problem of LLM-based machine translation. We experiment
with several techniques for ensembling hypotheses produced by LLMs such as
ChatGPT, LLaMA, and Alpaca. We provide a comprehensive study along multiple
dimensions, including the method to generate hypotheses (multiple prompts,
temperature-based sampling, and beam search) and the strategy to produce the
final translation (instruction-based, quality-based reranking, and minimum
Bayes risk (MBR) decoding). Our results show that MBR decoding is a very
effective method, that translation quality can be improved using a small number
of samples, and that instruction tuning has a strong impact on the relation
between the diversity of the hypotheses and the sampling temperature.
Related papers
- Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.
Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs [4.122612309805664]
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step.
We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by scaling according to the top token's probability.
We conduct extensive experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing, demonstrating that min-p sampling improves both the quality and diversity of generated text, particularly at high temperatures.
arXiv Detail & Related papers (2024-07-01T08:37:25Z) - QUEST: Quality-Aware Metropolis-Hastings Sampling for Machine Translation [25.165239478219267]
We propose a simple and effective way to avoid over-reliance on noisy quality estimates by using them as the energy function of a Gibbs distribution.
Instead of looking for a mode in the distribution, we generate multiple samples from high-density areas through the Metropolis-Hastings algorithm.
arXiv Detail & Related papers (2024-05-28T17:36:06Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - A Preference-driven Paradigm for Enhanced Translation with Large Language Models [33.51585908894444]
Large language models (LLMs) can achieve remarkable translation performance using only a small amount of parallel data.
SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references.
We propose a preference-based approach built upon the Plackett-Luce model to overcome this plateau.
arXiv Detail & Related papers (2024-04-17T11:52:47Z) - Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning [57.323716555996114]
Off-target translation remains an unsolved problem, especially for low-resource languages.
Recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs.
In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs.
arXiv Detail & Related papers (2024-03-21T13:47:40Z) - Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.
Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning.
This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z) - On the Calibration of Multilingual Question Answering LLMs [57.296161186129545]
We benchmark the calibration of several multilingual Large Language Models (MLLMs) on a variety of Question Answering tasks.
We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings.
For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data.
arXiv Detail & Related papers (2023-11-15T03:29:02Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - Pareto Optimal Learning for Estimating Large Language Model Errors [12.21899680905672]
Large Language Models (LLMs) have shown impressive abilities in many applications.
We present a method that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information.
arXiv Detail & Related papers (2023-06-28T21:11:15Z) - Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize.
We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.