Generating Full Length Wikipedia Biographies: The Impact of Gender Bias
on the Retrieval-Based Generation of Women Biographies
- URL: http://arxiv.org/abs/2204.05879v1
- Date: Tue, 12 Apr 2022 15:16:57 GMT
- Title: Generating Full Length Wikipedia Biographies: The Impact of Gender Bias
on the Retrieval-Based Generation of Women Biographies
- Authors: Angela Fan, Claire Gardent
- Abstract summary: We develop a model for English text that uses a retrieval mechanism to identify relevant supporting information on the web.
A cache-based pre-trained encoder-decoder is used to generate long-form biographies section by section, including citation information.
We analyze our generated text to understand how differences in available web evidence data affect generation.
- Score: 22.842874899794996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating factual, long-form text such as Wikipedia articles raises three
key challenges: how to gather relevant evidence, how to structure information
into well-formed text, and how to ensure that the generated text is factually
correct. We address these by developing a model for English text that uses a
retrieval mechanism to identify relevant supporting information on the web and
a cache-based pre-trained encoder-decoder to generate long-form biographies
section by section, including citation information. To assess the impact of
available web evidence on the output text, we compare the performance of our
approach when generating biographies about women (for which less information is
available on the web) vs. biographies generally. To this end, we curate a
dataset of 1,500 biographies about women. We analyze our generated text to
understand how differences in available web evidence data affect generation. We
evaluate the factuality, fluency, and quality of the generated texts using
automatic metrics and human evaluation. We hope that these techniques can be
used as a starting point for human writers, to aid in reducing the complexity
inherent in the creation of long-form, factual text.
Related papers
- Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models [11.597314728459573]
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.
We propose STORM, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking.
arXiv Detail & Related papers (2024-02-22T01:20:17Z) - Real or Fake Text?: Investigating Human Ability to Detect Boundaries
Between Human-Written and Machine-Generated Text [23.622347443796183]
We study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models.
We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time.
arXiv Detail & Related papers (2022-12-24T06:40:25Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Time-aware Prompting for Text Generation [17.58231642569116]
We study the effects of incorporating timestamps, such as document creation dates, into generation systems.
Two types of time-aware prompts are investigated: (1) textual prompts that encode document timestamps in natural language sentences; and (2) linear prompts that convert timestamps into continuous vectors.
arXiv Detail & Related papers (2022-11-03T22:10:25Z) - Cloning Ideology and Style using Deep Learning [0.0]
Research focuses on text generation based on the ideology and style of a specific author, and text generation on a topic that was not written by the same author in the past.
Bi-LSTM model is used to make predictions at the character level, during the training corpus of a specific author is used along with the ground truth corpus.
A pre-trained model is used to identify the sentences of ground truth having contradiction with the author's corpus to make our language model inclined.
arXiv Detail & Related papers (2022-10-25T11:37:19Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - A Survey of Knowledge-Enhanced Text Generation [81.24633231919137]
The goal of text generation is to make machines express in human language.
Various neural encoder-decoder models have been proposed to achieve the goal by learning to map input text to output text.
To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models.
arXiv Detail & Related papers (2020-10-09T06:46:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.