Addressing the Length Bias Problem in Document-Level Neural Machine
  Translation
        - URL: http://arxiv.org/abs/2311.11601v1
- Date: Mon, 20 Nov 2023 08:29:52 GMT
- Title: Addressing the Length Bias Problem in Document-Level Neural Machine
  Translation
- Authors: Zhuocheng Zhang, Shuhao Gu, Min Zhang, Yang Feng
- Abstract summary: Document-level neural machine translation (DNMT) has shown promising results by incorporating more context information.
DNMT suffers from significant translation quality degradation when decoding documents that are much shorter or longer than the maximum sequence length.
We propose to improve the DNMT model in training method, attention mechanism, and decoding strategy.
- Score: 29.590471092149375
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract:   Document-level neural machine translation (DNMT) has shown promising results
by incorporating more context information. However, this approach also
introduces a length bias problem, whereby DNMT suffers from significant
translation quality degradation when decoding documents that are much shorter
or longer than the maximum sequence length during training. %i.e., the length
bias problem. To solve the length bias problem, we propose to improve the DNMT
model in training method, attention mechanism, and decoding strategy. Firstly,
we propose to sample the training data dynamically to ensure a more uniform
distribution across different sequence lengths. Then, we introduce a
length-normalized attention mechanism to aid the model in focusing on target
information, mitigating the issue of attention divergence when processing
longer sequences. Lastly, we propose a sliding window strategy during decoding
that integrates as much context information as possible without exceeding the
maximum sequence length. The experimental results indicate that our method can
bring significant improvements on several open datasets, and further analysis
shows that our method can significantly alleviate the length bias problem.
 
      
        Related papers
        - Beyond Fixed: Variable-Length Denoising for Diffusion Large Language   Models [74.15250326312179]
 Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
 arXiv  Detail & Related papers  (2025-08-01T17:56:07Z)
- Context-aware Biases for Length Extrapolation [0.0]
 We propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE)<n>By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs.<n>Our method significantly enhances the performance of existing RPE methods tested on the FineWeb-Edu10B and WikiText-103 datasets.
 arXiv  Detail & Related papers  (2025-03-11T05:54:58Z)
- Layer-Specific Scaling of Positional Encodings for Superior Long-Context   Modeling [26.310612987107813]
 Large language models suffer from the lost-in-the-middle'' problem, where crucial information in the middle of the context is often underrepresented or lost.
We propose a layer-specific positional encoding scaling method that assigns distinct scaling factors to each layer.
Our approach results in an average accuracy improvement of up to 20% on the Key-Value Retrieval dataset.
 arXiv  Detail & Related papers  (2025-03-06T11:59:55Z)
- Selecting Influential Samples for Long Context Alignment via Homologous   Models' Guidance and Contextual Awareness Measurement [62.87020831987625]
 We propose a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations.
We select the most challenging samples as the influential data to effectively frame the long-range dependencies.
Experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
 arXiv  Detail & Related papers  (2024-10-21T04:30:53Z)
- Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning   and Context Length Extension [21.729875191721984]
 We introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention.
We also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions.
Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length.
 arXiv  Detail & Related papers  (2024-10-05T15:59:32Z)
- Beyond Fixed Length: Bucket Pre-training is All You Need [27.273944625005377]
 We propose a novel multi-bucket data composition method that transcends the fixed-length paradigm.<n>Our approach adaptively organizes training data to achieve optimal composition quality as measured by the proposed metrics.
 arXiv  Detail & Related papers  (2024-07-10T09:27:23Z)
- CItruS: Chunked Instruction-aware State Eviction for Long Sequence   Modeling [52.404072802235234]
 We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
 arXiv  Detail & Related papers  (2024-06-17T18:34:58Z)
- Dataset Decomposition: Faster LLM Training with Variable Sequence Length   Curriculum [30.46329559544246]
 We introduce dataset decomposition, a novel variable sequence length training technique.
We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach.
 Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
 arXiv  Detail & Related papers  (2024-05-21T22:26:01Z)
- Latent-based Diffusion Model for Long-tailed Recognition [10.410057703866899]
 Long-tailed imbalance distribution is a common issue in practical computer vision applications.
We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR) as a feature augmentation method to tackle the issue.
The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.
 arXiv  Detail & Related papers  (2024-04-06T06:15:07Z)
- Effective Long-Context Scaling of Foundation Models [90.57254298730923]
 We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
 arXiv  Detail & Related papers  (2023-09-27T21:41:49Z)
- Instruction Position Matters in Sequence Generation with Large Language
  Models [67.87516654892343]
 Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
 arXiv  Detail & Related papers  (2023-08-23T12:36:57Z)
- AdAUC: End-to-end Adversarial AUC Optimization Against Long-tail
  Problems [102.95119281306893]
 We present an early trial to explore adversarial training methods to optimize AUC.
We reformulate the AUC optimization problem as a saddle point problem, where the objective becomes an instance-wise function.
Our analysis differs from the existing studies since the algorithm is asked to generate adversarial examples by calculating the gradient of a min-max problem.
 arXiv  Detail & Related papers  (2022-06-24T09:13:39Z)
- Sequence Length is a Domain: Length-based Overfitting in Transformer
  Models [0.0]
 In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
 arXiv  Detail & Related papers  (2021-09-15T13:25:19Z)
- Short-Term Memory Optimization in Recurrent Neural Networks by
  Autoencoder-based Initialization [79.42778415729475]
 We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
 arXiv  Detail & Related papers  (2020-11-05T14:57:16Z)
- Longitudinal Deep Kernel Gaussian Process Regression [16.618767289437905]
 We introduce Longitudinal deep kernel process regression (L-DKGPR)
L-DKGPR automates the discovery of complex multilevel correlation structure from longitudinal data.
We derive an efficient algorithm to train L-DKGPR using latent space inducing points and variational inference.
 arXiv  Detail & Related papers  (2020-05-24T15:10:48Z)
- Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
 Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
 arXiv  Detail & Related papers  (2020-04-11T03:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.