A Benchmark for Text Expansion: Datasets, Metrics, and Baselines
- URL: http://arxiv.org/abs/2309.09198v1
- Date: Sun, 17 Sep 2023 07:54:38 GMT
- Title: A Benchmark for Text Expansion: Datasets, Metrics, and Baselines
- Authors: Yi Chen, Haiyun Jiang, Wei Bi, Rui Wang, Longyue Wang, Shuming Shi,
Ruifeng Xu
- Abstract summary: This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifier into proper locations of the plain text.
We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references.
On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines.
- Score: 87.47745669317894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a new task of Text Expansion (TE), which aims to insert
fine-grained modifiers into proper locations of the plain text to concretize or
vivify human writings. Different from existing insertion-based writing
assistance tasks, TE requires the model to be more flexible in both locating
and generation, and also more cautious in keeping basic semantics. We leverage
four complementary approaches to construct a dataset with 12 million
automatically generated instances and 2K human-annotated references for both
English and Chinese. To facilitate automatic evaluation, we design various
metrics from multiple perspectives. In particular, we propose Info-Gain to
effectively measure the informativeness of expansions, which is an important
quality dimension in TE. On top of a pre-trained text-infilling model, we build
both pipelined and joint Locate&Infill models, which demonstrate the
superiority over the Text2Text baselines, especially in expansion
informativeness. Experiments verify the feasibility of the TE task and point
out potential directions for future research toward better automatic text
expansion.
Related papers
- Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z) - Automatic and Human-AI Interactive Text Generation [27.05024520190722]
This tutorial aims to provide an overview of the state-of-the-art natural language generation research.
Text-to-text generation tasks are more constrained in terms of semantic consistency and targeted language styles.
arXiv Detail & Related papers (2023-10-05T20:26:15Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - GenAug: Data Augmentation for Finetuning Text Generators [21.96895115572357]
We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews.
Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods.
arXiv Detail & Related papers (2020-10-05T05:46:39Z) - Text Data Augmentation: Towards better detection of spear-phishing
emails [1.6556358263455926]
We propose a corpus and task augmentation framework to augment English texts within our company.
Our proposal combines different methods, utilizing BERT language model, multi-step back-translation and agnostics.
We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora.
arXiv Detail & Related papers (2020-07-04T07:45:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.