Related papers: Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

URL: http://arxiv.org/abs/2212.09912v2
Date: Tue, 24 Oct 2023 20:59:33 GMT
Title: Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks
Authors: Kaiser Sun, Peng Qi, Yuhao Zhang, Lan Liu, William Yang Wang, Zhiheng Huang
Abstract summary: We identify the issue of tokenization inconsistency that is commonly neglected in training generative models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets.
Score: 54.306234256074255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.

Related papers

LPLgrad: Optimizing Active Learning Through Gradient Norm Sample Selection and Auxiliary Model Training [2.762397703396293]
Loss Prediction Loss with Gradient Norm (LPLgrad) is designed to quantify model uncertainty effectively and improve the accuracy of image classification tasks. LPLgrad operates in two distinct phases: (i) em Training Phase aims to predict the loss for input features by jointly training a main model and an auxiliary model. This dual-model approach enhances the ability to extract complex input features and learn intrinsic patterns from the data effectively.
arXiv Detail & Related papers (2024-11-20T18:12:59Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging) It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z)
Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions [10.621564997491808]
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models. We investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt, it increases the robustness of saliency scores to question perturbations and variations in model output.
arXiv Detail & Related papers (2023-07-25T08:51:30Z)
RF+clust for Leave-One-Problem-Out Performance Prediction [0.9281671380673306]
We study leave-one-problem-out (LOPO) performance prediction. We analyze whether standard random forest (RF) model predictions can be improved by calibrating them with a weighted average of performance values.
arXiv Detail & Related papers (2023-01-23T16:14:59Z)
DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization [75.72231742114951]
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
arXiv Detail & Related papers (2022-03-21T18:04:25Z)
Question-Based Salient Span Selection for More Controllable Text Summarization [67.68208237480646]
We propose a method for incorporating question-answering (QA) signals into a summarization model. Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs. This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary.
arXiv Detail & Related papers (2021-11-15T17:36:41Z)
Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models [19.21465581259624]
Many problems can be attributed to models exploiting spurious correlations, or shortcuts between the training data and the task labels. In this paper, we aim to automatically identify such spurious correlations in NLP models at scale. We show that our proposed method can effectively and efficiently identify a scalable set of "shortcuts", and mitigating these leads to more robust models in multiple applications.
arXiv Detail & Related papers (2021-10-14T21:40:03Z)
Paired Examples as Indirect Supervision in Latent Decision Models [109.76417071249945]
We introduce a way to leverage paired examples that provide stronger cues for learning latent decisions. We apply our method to improve compositional question answering using neural module networks on the DROP dataset.
arXiv Detail & Related papers (2021-04-05T03:58:30Z)
FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation [19.73842483996047]
We develop FiD-Ex, which addresses shortcomings for seq2seq models by introducing sentence markers to eliminate explanation fabrication. FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy, on multiple tasks from the ERASER explainability benchmark.
arXiv Detail & Related papers (2020-12-31T07:22:15Z)
Goal-directed Generation of Discrete Structures with Conditional Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.