HTLM: Hyper-Text Pre-Training and Prompting of Language Models
- URL: http://arxiv.org/abs/2107.06955v1
- Date: Wed, 14 Jul 2021 19:39:31 GMT
- Title: HTLM: Hyper-Text Pre-Training and Prompting of Language Models
- Authors: Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu,
Gargi Ghosh, Luke Zettlemoyer
- Abstract summary: We introduce HTLM, a hyper-text language model trained on a large-scale web crawl.
We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels.
We find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs.
- Score: 52.32659647159799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce HTLM, a hyper-text language model trained on a large-scale web
crawl. Modeling hyper-text has a number of advantages: (1) it is easily
gathered at scale, (2) it provides rich document-level and end-task-adjacent
supervision (e.g. class and id attributes often encode document category
information), and (3) it allows for new structured prompting that follows the
established semantics of HTML (e.g. to do zero-shot summarization by infilling
title tags for a webpage that contains the input text). We show that
pretraining with a BART-style denoising loss directly on simplified HTML
provides highly effective transfer for a wide range of end tasks and
supervision levels. HTLM matches or exceeds the performance of comparably sized
text-only LMs for zero-shot prompting and fine-tuning for classification
benchmarks, while also setting new state-of-the-art performance levels for
zero-shot summarization. We also find that hyper-text prompts provide more
value to HTLM, in terms of data efficiency, than plain text prompts do for
existing LMs, and that HTLM is highly effective at auto-prompting itself, by
simply generating the most likely hyper-text formatting for any available
training data. We will release all code and models to support future HTLM
research.
Related papers
- Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models [14.411646409316624]
We introduce textbfHierarchical textbfText-textbfFree textbfAlignment (textbfTS-HTFA), a novel method for time-series forecasting.
We replace paired text data with adaptive virtual text based on QR decomposition word embeddings and learnable prompt.
Experiments on multiple time-series benchmarks demonstrate that HTFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-23T12:57:24Z) - Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification [0.0]
Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks.
We show that smaller, fine-tuned LLMs consistently and significantly outperform larger, zero-shot prompted models in text classification.
arXiv Detail & Related papers (2024-06-12T21:46:13Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision [41.05874642535256]
Hierarchical text classification is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing.
Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire.
To alleviate human efforts, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision.
arXiv Detail & Related papers (2024-02-29T22:26:07Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.