Tokenization Consistency Matters for Generative Models on Extractive NLP
Tasks
- URL: http://arxiv.org/abs/2212.09912v2
- Date: Tue, 24 Oct 2023 20:59:33 GMT
- Title: Tokenization Consistency Matters for Generative Models on Extractive NLP
Tasks
- Authors: Kaiser Sun, Peng Qi, Yuhao Zhang, Lan Liu, William Yang Wang, Zhiheng
Huang
- Abstract summary: We identify the issue of tokenization inconsistency that is commonly neglected in training generative models.
This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently.
We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets.
- Score: 54.306234256074255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models have been widely applied to solve extractive tasks, where
parts of the input is extracted to form the desired output, and achieved
significant success. For example, in extractive question answering (QA),
generative models have constantly yielded state-of-the-art results. In this
work, we identify the issue of tokenization inconsistency that is commonly
neglected in training these models. This issue damages the extractive nature of
these tasks after the input and output are tokenized inconsistently by the
tokenizer, and thus leads to performance drop as well as hallucination. We
propose a simple yet effective fix to this issue and conduct a case study on
extractive QA. We show that, with consistent tokenization, the model performs
better in both in-domain and out-of-domain datasets, with a notable average of
+1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA
datasets. Further, the model converges faster, and becomes less likely to
generate out-of-context answers. With these findings, we would like to call for
more attention on how tokenization should be done when solving extractive tasks
and recommend applying consistent tokenization during training.
Related papers
- Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - Analyzing Chain-of-Thought Prompting in Large Language Models via
Gradient-based Feature Attributions [10.621564997491808]
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models.
We investigate whether CoT prompting affects the relative importances they assign to particular input tokens.
Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt, it increases the robustness of saliency scores to question perturbations and variations in model output.
arXiv Detail & Related papers (2023-07-25T08:51:30Z) - RF+clust for Leave-One-Problem-Out Performance Prediction [0.9281671380673306]
We study leave-one-problem-out (LOPO) performance prediction.
We analyze whether standard random forest (RF) model predictions can be improved by calibrating them with a weighted average of performance values.
arXiv Detail & Related papers (2023-01-23T16:14:59Z) - DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and
Quantization [75.72231742114951]
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks.
These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency.
We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
arXiv Detail & Related papers (2022-03-21T18:04:25Z) - Question-Based Salient Span Selection for More Controllable Text
Summarization [67.68208237480646]
We propose a method for incorporating question-answering (QA) signals into a summarization model.
Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs.
This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary.
arXiv Detail & Related papers (2021-11-15T17:36:41Z) - Identifying and Mitigating Spurious Correlations for Improving
Robustness in NLP Models [19.21465581259624]
Many problems can be attributed to models exploiting spurious correlations, or shortcuts between the training data and the task labels.
In this paper, we aim to automatically identify such spurious correlations in NLP models at scale.
We show that our proposed method can effectively and efficiently identify a scalable set of "shortcuts", and mitigating these leads to more robust models in multiple applications.
arXiv Detail & Related papers (2021-10-14T21:40:03Z) - Paired Examples as Indirect Supervision in Latent Decision Models [109.76417071249945]
We introduce a way to leverage paired examples that provide stronger cues for learning latent decisions.
We apply our method to improve compositional question answering using neural module networks on the DROP dataset.
arXiv Detail & Related papers (2021-04-05T03:58:30Z) - FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale
Generation [19.73842483996047]
We develop FiD-Ex, which addresses shortcomings for seq2seq models by introducing sentence markers to eliminate explanation fabrication.
FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy, on multiple tasks from the ERASER explainability benchmark.
arXiv Detail & Related papers (2020-12-31T07:22:15Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.