On-the-Fly Attention Modularization for Neural Generation
- URL: http://arxiv.org/abs/2101.00371v1
- Date: Sat, 2 Jan 2021 05:16:46 GMT
- Title: On-the-Fly Attention Modularization for Neural Generation
- Authors: Yue Dong, Chandra Bhagavatula, Ximing Lu, Jena D. Hwang, Antoine
Bosselut, Jackie Chi Kit Cheung, Yejin Choi
- Abstract summary: We show that generated text is repetitive, generic, self-inconsistent, and lacking commonsense.
Our findings motivate on-the-fly attention modularization, a simple but effective method for injecting inductive biases into attention during inference.
- Score: 54.912042110885366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite considerable advancements with deep neural language models (LMs),
neural text generation still suffers from degeneration: generated text is
repetitive, generic, self-inconsistent, and lacking commonsense. The empirical
analyses on sentence-level attention patterns reveal that neural text
degeneration may be associated with insufficient learning of inductive biases
by the attention mechanism. Our findings motivate on-the-fly attention
modularization, a simple but effective method for injecting inductive biases
into attention computation during inference. The resulting text produced by the
language model with attention modularization can yield enhanced diversity and
commonsense reasoning while maintaining fluency and coherence.
Related papers
- Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies [7.21603206617401]
We show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation magnitude to masking.
These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve.
arXiv Detail & Related papers (2024-06-05T00:31:50Z) - Repetition In Repetition Out: Towards Understanding Neural Text
Degeneration from the Data Perspective [91.14291142262262]
This work presents a straightforward and fundamental explanation from the data perspective.
Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data.
Our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.
arXiv Detail & Related papers (2023-10-16T09:35:42Z) - NeuroExplainer: Fine-Grained Attention Decoding to Uncover Cortical
Development Patterns of Preterm Infants [73.85768093666582]
We propose an explainable geometric deep network dubbed NeuroExplainer.
NeuroExplainer is used to uncover altered infant cortical development patterns associated with preterm birth.
arXiv Detail & Related papers (2023-01-01T12:48:12Z) - Demystifying Neural Language Models' Insensitivity to Word-Order [7.72780997900827]
We investigate the insensitivity of natural language models to word-order by quantifying perturbations.
We find that neural language models require local ordering more so than the global ordering of tokens.
arXiv Detail & Related papers (2021-07-29T13:34:20Z) - Learning to Rationalize for Nonmonotonic Reasoning with Distant
Supervision [44.32874972577682]
We investigate the extent to which neural models can reason about natural language rationales that explain model predictions.
We use pre-trained language models, neural knowledge models, and distant supervision from related tasks.
Our model shows promises at generating post-hoc rationales explaining why an inference is more or less likely given the additional information.
arXiv Detail & Related papers (2020-12-14T23:50:20Z) - On Long-Tailed Phenomena in Neural Machine Translation [50.65273145888896]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens.
We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation.
We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy.
arXiv Detail & Related papers (2020-10-10T07:00:57Z) - Neural Data-to-Text Generation via Jointly Learning the Segmentation and
Correspondence [48.765579605145454]
We propose to explicitly segment target text into fragment units and align them with their data correspondences.
The resulting architecture maintains the same expressive power as neural attention models.
On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.
arXiv Detail & Related papers (2020-05-03T14:28:28Z) - A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process.
We propose a framework that we call controllable grounded response generation (CGRG)
We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.