End-to-End Long Document Summarization using Gradient Caching
- URL: http://arxiv.org/abs/2501.01805v1
- Date: Fri, 03 Jan 2025 13:32:57 GMT
- Title: End-to-End Long Document Summarization using Gradient Caching
- Authors: Rohit Saxena, Hao Tang, Frank Keller,
- Abstract summary: Training transformer-based encoder-decoder models for long document summarization poses a significant challenge.
We propose CachED (Gradient $textbfCach$ing for $textbfE$ncoder-$textbfD$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models.
- Score: 16.52198368672941
- License:
- Abstract: Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at test time, but training with these approaches is still difficult, requiring truncation of input documents and causing a mismatch between training and test conditions. In this work, we propose CachED (Gradient $\textbf{Cach}$ing for $\textbf{E}$ncoder-$\textbf{D}$ecoder models), an approach that enables end-to-end training of existing transformer-based encoder-decoder models, using the entire document without truncation. Specifically, we apply non-overlapping sliding windows to input documents, followed by fusion in decoder. During backpropagation, the gradients are cached at the decoder and are passed through the encoder in chunks by re-computing the hidden vectors, similar to gradient checkpointing. In the experiments on long document summarization, we extend BART to CachED BART, processing more than 500K tokens during training and achieving superior performance without using any additional parameters.
Related papers
- Explicit and data-Efficient Encoding via Gradient Flow [13.424502866278822]
We introduce a decoder-only method using gradient flow to directly encode data into the latent space.
We train the decoder via the adjoint method and show that costly integrals can be avoided with minimal accuracy loss.
This work paves the way for integrating machine learning into scientific, where precise and efficient encoding is critical.
arXiv Detail & Related papers (2024-12-01T15:54:50Z) - Equipping Transformer with Random-Access Reading for Long-Context Understanding [9.433800833564279]
Long-context modeling presents a significant challenge for transformer-based large language models.
We propose a novel reading strategy that enables transformers to efficiently process long documents without examining every token.
arXiv Detail & Related papers (2024-05-21T21:41:07Z) - Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval [26.00149743478937]
Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems.
We propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task.
Our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters.
arXiv Detail & Related papers (2024-01-20T15:02:33Z) - Unlimiformer: Long-Range Transformers with Unlimited Length Input [67.04942180004805]
Unlimiformer is a general approach that wraps any existing pretrained encoder-decoder transformer.
It offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index.
We show that Unlimiformer can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time.
arXiv Detail & Related papers (2023-05-02T17:35:08Z) - Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration.
CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root.
We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - KRNet: Towards Efficient Knowledge Replay [50.315451023983805]
A knowledge replay technique has been widely used in many tasks such as continual learning and continuous domain adaptation.
We propose a novel and efficient knowledge recording network (KRNet) which directly maps an arbitrary sample identity number to the corresponding datum.
Our KRNet requires significantly less storage cost for the latent codes and can be trained without the encoder sub-network.
arXiv Detail & Related papers (2022-05-23T08:34:17Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action
Localization [96.73647162960842]
TAL is a fundamental yet challenging task in video understanding.
Existing TAL methods rely on pre-training a video encoder through action classification supervision.
We introduce a novel low-fidelity end-to-end (LoFi) video encoder pre-training method.
arXiv Detail & Related papers (2021-03-28T22:18:14Z) - Plug and Play Autoencoders for Conditional Text Generation [0.0]
We propose a method where any pretrained autoencoder can be used to train embedding-to-embedding.
This reduces the need for labeled training data for the task and makes the training procedure more efficient.
We show that our method performs better than or comparable to strong baselines while being up to four times faster.
arXiv Detail & Related papers (2020-10-06T19:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.