An Analysis of Multilingual FActScore
- URL: http://arxiv.org/abs/2406.19415v1
- Date: Thu, 20 Jun 2024 18:09:40 GMT
- Title: An Analysis of Multilingual FActScore
- Authors: Kim Trong Vu, Michael Krumdick, Varshini Reddy, Franck Dernoncourt, Viet Dac Lai,
- Abstract summary: FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in English.
This paper studies the limitations of each component in the four-component pipeline of FActScore in the multilingual setting.
- Score: 45.48784238480873
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in English. However, there has not been any work in studying the behavior of FActScore in other languages. This paper studies the limitations of each component in the four-component pipeline of FActScore in the multilingual setting. We introduce a new dataset for FActScore on texts generated by strong multilingual LLMs. Our evaluation shows that LLMs exhibit distinct behaviors in both fact extraction and fact scoring tasks. No LLM produces consistent and reliable FActScore across languages with varying levels of resources. We also find that the knowledge source plays an important role in the quality of the estimated FActScore. Using Wikipedia as the knowledge source may hinder the true FActScore of long-form text due to its limited coverage in medium- and low-resource languages. We also incorporate three mitigations to our knowledge source that ultimately improve FActScore estimation across all languages.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment [48.03702722532143]
MEXA is a method for assessing the multilingual capabilities of English-centric large language models.
It computes the alignment between English and non-English languages using parallel sentences.
This alignment can be used to estimate model performance in other languages.
arXiv Detail & Related papers (2024-10-08T09:59:23Z) - Quantifying Multilingual Performance of Large Language Models Across Languages [48.40607157158246]
Large Language Models (LLMs) perform better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate.
We propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations.
Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores.
arXiv Detail & Related papers (2024-04-17T16:53:16Z) - Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore [14.91669562846729]
We introduce a simple pipeline for multilingual factuality evaluation, by applying FActScore for diverse languages.
We evaluate the factual accuracy of long-form text generation in topics that reflect regional diversity.
arXiv Detail & Related papers (2024-02-28T04:43:46Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations [63.90357081534995]
Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims.
We show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity.
We introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities.
arXiv Detail & Related papers (2024-02-08T12:36:29Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages.
AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.