Lissard: Long and Simple Sequential Reasoning Datasets
- URL: http://arxiv.org/abs/2402.07859v2
- Date: Tue, 20 Feb 2024 15:12:13 GMT
- Title: Lissard: Long and Simple Sequential Reasoning Datasets
- Authors: Mirelle Bueno, Roberto Lotufo, and Rodrigo Nogueira
- Abstract summary: Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens.
However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training.
We introduce Lissard, a benchmark comprising seven tasks whose goal is to assess the ability of models to process and generate wide-range sequence lengths.
- Score: 10.39816548971042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models are now capable of solving tasks that require dealing with
long sequences consisting of hundreds of thousands of tokens. However, they
often fail on tasks that require repetitive use of simple rules, even on
sequences that are much shorter than those seen during training. For example,
state-of-the-art LLMs can find common items in two lists with up to 20 items
but fail when lists have 80 items. In this paper, we introduce Lissard, a
benchmark comprising seven tasks whose goal is to assess the ability of models
to process and generate wide-range sequence lengths, requiring repetitive
procedural execution. Our evaluation of open-source (Mistral-7B and
Mixtral-8x7B) and proprietary models (GPT-3.5 and GPT-4) show a consistent
decline in performance across all models as the complexity of the sequence
increases. The datasets and code are available at
https://github.com/unicamp-dl/Lissard
Related papers
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [74.89981179257194]
LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs)
LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans.
We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT
arXiv Detail & Related papers (2025-01-09T18:16:55Z) - Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making.
Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance.
We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z) - MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks [10.39816548971042]
Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens.
However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training.
We introduce MLissard, a benchmark designed to evaluate models' abilities to process and generate texts of varied lengths.
arXiv Detail & Related papers (2024-10-08T21:59:31Z) - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.
Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.
We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - Graph-enhanced Large Language Models in Asynchronous Plan Reasoning [18.402877904882107]
We find that large language models (LLMs) behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow.
We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-02-05T08:26:33Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - AskIt: Unified Programming Interface for Programming with Large Language
Models [0.0]
Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks.
This paper introduces AskIt, a domain-specific language specifically designed for LLMs.
Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks.
arXiv Detail & Related papers (2023-08-29T21:44:27Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - HiPool: Modeling Long Documents Using Graph Neural Networks [24.91040673099863]
Long sequences in Natural Language Processing (NLP) are a challenging problem.
Recent pretraining language models achieve satisfying performances in many NLP tasks.
We propose a new challenging benchmark, totaling six datasets with up to 53k samples and 4034 average tokens' length.
arXiv Detail & Related papers (2023-05-05T06:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.