Related papers: LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

URL: http://arxiv.org/abs/2503.02502v1
Date: Tue, 04 Mar 2025 11:10:13 GMT
Title: LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
Authors: Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang,
Abstract summary: Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs)<n>We propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM)<n>LADM can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus.
Score: 8.34562564266839
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.

Related papers

LongAttn: Selecting Long-context Training Data via Token-level Attention [16.30530770590871]
LongAttn is a token-level framework to measure the long-range dependencies for the data.<n>We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code)
arXiv Detail & Related papers (2025-02-24T05:51:53Z)
SEAL: Scaling to Emphasize Attention for Long-Context Retrieval [8.805524738976075]
We introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL)<n>We observe that specific attention heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores.<n>We propose a learning-based mechanism that leverages generated data to emphasize these heads.
arXiv Detail & Related papers (2025-01-25T14:09:39Z)
Reducing Distraction in Long-Context Language Models by Focused Learning [6.803882766744194]
We propose a novel training method that enhances Large Language Models' ability to discern relevant information. During fine-tuning with long contexts, we employ a retriever to extract the most relevant segments. We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned.
arXiv Detail & Related papers (2024-11-08T19:27:42Z)
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models [62.698520962933195]
Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning. We propose a novel training-free context pruning method that selectively removes less critical textual information.
arXiv Detail & Related papers (2024-10-25T17:59:09Z)
GATEAU: Selecting Influential Samples for Long Context Alignment [62.87020831987625]
GATEAU identifies influential samples enriched with long-range dependency relations.<n>Experiments indicate that GATEAU effectively identifies influential samples and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
arXiv Detail & Related papers (2024-10-21T04:30:53Z)
A Controlled Study on Long Context Extension and Generalization in LLMs [85.4758128256142]
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data.
arXiv Detail & Related papers (2024-09-18T17:53:17Z)
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens. We develop two novel methods for creating synthetic data. LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications. We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.