HTLM: Hyper-Text Pre-Training and Prompting of Language Models
- URL: http://arxiv.org/abs/2107.06955v1
- Date: Wed, 14 Jul 2021 19:39:31 GMT
- Title: HTLM: Hyper-Text Pre-Training and Prompting of Language Models
- Authors: Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu,
Gargi Ghosh, Luke Zettlemoyer
- Abstract summary: We introduce HTLM, a hyper-text language model trained on a large-scale web crawl.
We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels.
We find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs.
- Score: 52.32659647159799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce HTLM, a hyper-text language model trained on a large-scale web
crawl. Modeling hyper-text has a number of advantages: (1) it is easily
gathered at scale, (2) it provides rich document-level and end-task-adjacent
supervision (e.g. class and id attributes often encode document category
information), and (3) it allows for new structured prompting that follows the
established semantics of HTML (e.g. to do zero-shot summarization by infilling
title tags for a webpage that contains the input text). We show that
pretraining with a BART-style denoising loss directly on simplified HTML
provides highly effective transfer for a wide range of end tasks and
supervision levels. HTLM matches or exceeds the performance of comparably sized
text-only LMs for zero-shot prompting and fine-tuning for classification
benchmarks, while also setting new state-of-the-art performance levels for
zero-shot summarization. We also find that hyper-text prompts provide more
value to HTLM, in terms of data efficiency, than plain text prompts do for
existing LMs, and that HTLM is highly effective at auto-prompting itself, by
simply generating the most likely hyper-text formatting for any available
training data. We will release all code and models to support future HTLM
research.
Related papers
- Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification [0.0]
Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks.
We show that smaller, fine-tuned LLMs consistently and significantly outperform larger, zero-shot prompted models in text classification.
arXiv Detail & Related papers (2024-06-12T21:46:13Z) - 5W1H Extraction With Large Language Models [27.409473072672277]
The extraction of essential news elements through the 5W1H framework is critical for event extraction and text summarization.
ChatGPT has encountered challenges in processing longer news texts and analyzing specific attributes in context.
We design several strategies from zero-shot/few-shot prompting to efficient fine-tuning to conduct 5W1H aspects extraction from the original news documents.
arXiv Detail & Related papers (2024-05-25T09:42:58Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z) - DOM-LM: Learning Generalizable Representations for HTML Documents [33.742833774918786]
We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches.
We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
arXiv Detail & Related papers (2022-01-25T20:10:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.