Related papers: MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

URL: http://arxiv.org/abs/2410.06396v1
Date: Tue, 8 Oct 2024 21:59:31 GMT
Title: MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
Authors: Mirelle Bueno, Roberto Lotufo, Rodrigo Nogueira,
Abstract summary: Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. We introduce MLissard, a benchmark designed to evaluate models' abilities to process and generate texts of varied lengths.
Score: 10.39816548971042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce MLissard, a multilingual benchmark designed to evaluate models' abilities to process and generate texts of varied lengths and offers a mechanism for controlling sequence complexity. Our evaluation of open-source and proprietary models show a consistent decline in performance across all models and languages as the complexity of the sequence increases. Surprisingly, the use of in-context examples in languages other than English helps increase extrapolation performance significantly. The datasets and code are available at https://github.com/unicamp-dl/Lissard

Related papers

RELIC: Evaluating Compositional Instruction Following via Language Recognition [37.49115450182637]
Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context.<n>We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition.
arXiv Detail & Related papers (2025-06-05T16:17:24Z)
On Many-Shot In-Context Learning for Long-Context Evaluation [10.500629810624769]
This paper delves into long-context language model evaluation through many-shot ICL. We develop metrics to categorize ICL tasks into two groups: similar-sample learning (SSL) and all-sample learning (ASL) We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
arXiv Detail & Related papers (2024-11-11T17:00:59Z)
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models. Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We find that Llama Instruct and Mistral models exhibit high degrees of language confusion. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Lissard: Long and Simple Sequential Reasoning Datasets [10.39816548971042]
Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. We introduce Lissard, a benchmark comprising seven tasks whose goal is to assess the ability of models to process and generate wide-range sequence lengths.
arXiv Detail & Related papers (2024-02-12T18:10:17Z)
Contextual Code Switching for Machine Translation using Language Models [1.4866655830571935]
Large language models (LLMs) have exerted a considerable impact on diverse language-related tasks in recent years. We present an extensive study on the code switching task specifically for the machine translation task comparing multiple LLMs. Our results indicate that despite the LLMs having promising results in the certain tasks, the models with relatively lesser complexity outperform the multilingual large language models in the machine translation task.
arXiv Detail & Related papers (2023-12-20T16:40:33Z)
The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)
Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text [5.532477732693001]
We show that a large language model can serve as a highly effective few-shot semantically. It can convert natural language sentences into a logical form that serves as input for answer set programs. We demonstrate that this method achieves state-of-the-art performance on several benchmarks, including bAbI, StepGame, CLUTRR, and gSCAN.
arXiv Detail & Related papers (2023-07-15T03:29:59Z)
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval [32.60391966381949]
We introduce xCodeEval, the largest executable multilingual multitask benchmark to date. It features a total of $7$ tasks involving code understanding, generation, translation and retrieval. xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval.
arXiv Detail & Related papers (2023-03-06T10:08:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.