Addressing the Length Bias Problem in Document-Level Neural Machine
Translation
- URL: http://arxiv.org/abs/2311.11601v1
- Date: Mon, 20 Nov 2023 08:29:52 GMT
- Title: Addressing the Length Bias Problem in Document-Level Neural Machine
Translation
- Authors: Zhuocheng Zhang, Shuhao Gu, Min Zhang, Yang Feng
- Abstract summary: Document-level neural machine translation (DNMT) has shown promising results by incorporating more context information.
DNMT suffers from significant translation quality degradation when decoding documents that are much shorter or longer than the maximum sequence length.
We propose to improve the DNMT model in training method, attention mechanism, and decoding strategy.
- Score: 29.590471092149375
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Document-level neural machine translation (DNMT) has shown promising results
by incorporating more context information. However, this approach also
introduces a length bias problem, whereby DNMT suffers from significant
translation quality degradation when decoding documents that are much shorter
or longer than the maximum sequence length during training. %i.e., the length
bias problem. To solve the length bias problem, we propose to improve the DNMT
model in training method, attention mechanism, and decoding strategy. Firstly,
we propose to sample the training data dynamically to ensure a more uniform
distribution across different sequence lengths. Then, we introduce a
length-normalized attention mechanism to aid the model in focusing on target
information, mitigating the issue of attention divergence when processing
longer sequences. Lastly, we propose a sliding window strategy during decoding
that integrates as much context information as possible without exceeding the
maximum sequence length. The experimental results indicate that our method can
bring significant improvements on several open datasets, and further analysis
shows that our method can significantly alleviate the length bias problem.
Related papers
- Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement [62.87020831987625]
We propose a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations.
We select the most challenging samples as the influential data to effectively frame the long-range dependencies.
Experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
arXiv Detail & Related papers (2024-10-21T04:30:53Z) - Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension [21.729875191721984]
We introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention.
We also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions.
Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length.
arXiv Detail & Related papers (2024-10-05T15:59:32Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique.
We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach.
Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z) - Latent-based Diffusion Model for Long-tailed Recognition [10.410057703866899]
Long-tailed imbalance distribution is a common issue in practical computer vision applications.
We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR) as a feature augmentation method to tackle the issue.
The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.
arXiv Detail & Related papers (2024-04-06T06:15:07Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - AdAUC: End-to-end Adversarial AUC Optimization Against Long-tail
Problems [102.95119281306893]
We present an early trial to explore adversarial training methods to optimize AUC.
We reformulate the AUC optimization problem as a saddle point problem, where the objective becomes an instance-wise function.
Our analysis differs from the existing studies since the algorithm is asked to generate adversarial examples by calculating the gradient of a min-max problem.
arXiv Detail & Related papers (2022-06-24T09:13:39Z) - Sequence Length is a Domain: Length-based Overfitting in Transformer
Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z) - Short-Term Memory Optimization in Recurrent Neural Networks by
Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z) - Longitudinal Deep Kernel Gaussian Process Regression [16.618767289437905]
We introduce Longitudinal deep kernel process regression (L-DKGPR)
L-DKGPR automates the discovery of complex multilevel correlation structure from longitudinal data.
We derive an efficient algorithm to train L-DKGPR using latent space inducing points and variational inference.
arXiv Detail & Related papers (2020-05-24T15:10:48Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.