TwistList: Resources and Baselines for Tongue Twister Generation
- URL: http://arxiv.org/abs/2306.03457v2
- Date: Wed, 7 Jun 2023 05:24:25 GMT
- Title: TwistList: Resources and Baselines for Tongue Twister Generation
- Authors: Tyler Loakman, Chen Tang and Chenghua Lin
- Abstract summary: We present work on the generation of tongue twisters, a form of language that is required to be phonetically conditioned to maximise sound overlap.
We present textbfTwistList, a large annotated dataset of tongue twisters, consisting of 2.1K+ human-authored examples.
We additionally present several benchmark systems for the proposed task of tongue twister generation, including models that both do and do not require training on in-domain data.
- Score: 17.317550526263183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work in phonetically-grounded language generation has mainly focused
on domains such as lyrics and poetry. In this paper, we present work on the
generation of tongue twisters - a form of language that is required to be
phonetically conditioned to maximise sound overlap, whilst maintaining semantic
consistency with an input topic, and still being grammatically correct. We
present \textbf{TwistList}, a large annotated dataset of tongue twisters,
consisting of 2.1K+ human-authored examples. We additionally present several
benchmark systems (referred to as TwisterMisters) for the proposed task of
tongue twister generation, including models that both do and do not require
training on in-domain data. We present the results of automatic and human
evaluation to demonstrate the performance of existing mainstream pre-trained
models in this task with limited (or no) task specific training and data, and
no explicit phonetic knowledge. We find that the task of tongue twister
generation is challenging for models under these conditions, yet some models
are still capable of generating acceptable examples of this language type.
Related papers
- Train & Constrain: Phonologically Informed Tongue-Twister Generation from Topics and Paraphrases [24.954896926774627]
We present a pipeline for generating phonologically informed tongue twisters from large language models (LLMs)
We show the results of automatic and human evaluation of smaller models trained on our generated dataset.
We introduce a phoneme-aware constrained decoding module (PACD) that can be integrated into an autoregressive language model.
arXiv Detail & Related papers (2024-03-20T18:13:17Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Leveraging Natural Supervision for Language Representation Learning and
Generation [8.083109555490475]
We describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.
We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks.
We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations.
arXiv Detail & Related papers (2022-07-21T17:26:03Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Few-Shot Text Generation with Pattern-Exploiting Training [12.919486518128734]
In this paper, we show that the underlying idea can also be applied to text generation tasks.
We adapt Pattern-Exploiting Training (PET), a recently proposed few-shot approach, for finetuning generative language models on text generation tasks.
arXiv Detail & Related papers (2020-12-22T10:53:07Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - QURIOUS: Question Generation Pretraining for Text Generation [13.595014409069584]
We propose question generation as a pretraining method, which better aligns with the text generation objectives.
Our text generation models pretrained with this method are better at understanding the essence of the input and are better language models for the target task.
arXiv Detail & Related papers (2020-04-23T08:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.