Related papers: In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

URL: http://arxiv.org/abs/2402.10790v2
Date: Wed, 21 Feb 2024 03:07:42 GMT
Title: In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
Abstract summary: We introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts. Fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $11times 106$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date.
Score: 4.8384738694883955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $11\times 10^6$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.

Related papers

Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts [23.076558892985986]
We introduce Sequential-NIAH, a benchmark designed to evaluate the capability of large language models to extract sequential information from long contexts. The benchmark includes contexts ranging from 8K to 128K tokens in length, with a dataset of 14,000 samples (2,000 reserved for testing)
arXiv Detail & Related papers (2025-04-07T03:50:12Z)
Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment [6.947361774195549]
We propose a modular classification pipeline that divides the relevance assessment task into multiple stages. One of our approaches showed an 18.4% Krippendorff's $alpha$ accuracy increase over OpenAI's GPT-4o mini.
arXiv Detail & Related papers (2025-01-24T07:33:39Z)
DocMamba: Efficient Document Pre-training with State Space Model [56.84200017560988]
We present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
arXiv Detail & Related papers (2024-09-18T11:34:28Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models [58.54538318912159]
M4LE is a benchmark for evaluating the long-sequence capability of large language models (LLMs) M4LE is based on a diverse NLP task pool comprising 36 NLP task types and 12 domains. We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs.
arXiv Detail & Related papers (2023-10-30T03:11:30Z)
Tuning Pre-trained Model via Moment Probing [62.445281364055795]
We propose a novel Moment Probing (MP) method to explore the potential of LP. MP performs a linear classification head based on the mean of final features. Our MP significantly outperforms LP and is competitive with counterparts at less training cost.
arXiv Detail & Related papers (2023-07-21T04:15:02Z)
Abstractive Summarization as Augmentation for Document-Level Event Detection [0.0]
We bridge the performance gap between shallow and deep models on document-level event detection by using abstractive text summarization as an augmentation method. We use four decoding methods for text generation, namely beam search, top-k sampling, top-p sampling, and contrastive search. Our results show that using the document title offers 2.04% and 3.19% absolute improvement in macro F1-score for linear SVM and RoBERTa, respectively.
arXiv Detail & Related papers (2023-05-29T11:28:26Z)
How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales. We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z)
MuLD: The Multitask Long Document Benchmark [4.835289158553091]
We present a new long document benchmark consisting of only documents over 10,000 tokens. We show that models with increased context length are better able to solve the tasks presented.
arXiv Detail & Related papers (2022-02-15T12:42:55Z)
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling [37.01593605084575]
TAS-Balanced is an efficient topic-aware query and balanced margin sampling technique. We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets.
arXiv Detail & Related papers (2021-04-14T16:49:18Z)
Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem. Given a query (e.g., a question), return the set of relevant documents from a large document corpus. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.